Intermittent site slowing
Incident Report for 17hats
Postmortem

For redundancy, 17hats uses multiple database servers. This morning one of the back-up servers encountered a problem, which slowed down the main database. This slow-down caused a traffic jam on our web servers, which made the site unreachable. Because of the load on the back-up server, it took us longer than desired to be able to turn it off, at which point things stabilized a little bit. Once all of the webservers restarted the site returned back to normal.

So how are we going to ensure this doesn’t happen again? The good news is that we are about two weeks away from updating our entire server architecture to the latest and greatest that Amazon has to offer. This new architecture has a much better database structure, and this problem should not recur (that said, this problem in theory should not have occurred to begin with, which we will certainly be investigating fully).

In addition, we recently installed a new 24x7 global server operations team that that is laser-focused on getting our site reliability up to the highest levels possible. Our business, like your business, relies on it. Many unseen back end improvements have already been made, and the team helped getting today’s situation under control faster than ever.

We know this impacted your day and your experience, and we will continue to work to regain your confidence.

Posted Apr 25, 2018 - 12:00 PDT

Resolved
Issue resolved, site operating at full capacity. Post mortem to follow.
Posted Apr 25, 2018 - 11:34 PDT
Monitoring
Issue identified. Site is currently operational for most-to-all users, and we are continuing to monitor for consistent peak performance.
Posted Apr 25, 2018 - 11:18 PDT
Investigating
Site is very slow for a subset of users, and on occasion intermittently unavailable. Investigation in progress. Updates will be forthcoming.
Posted Apr 25, 2018 - 10:21 PDT