For redundancy, 17hats uses multiple database servers. This morning one of the back-up servers encountered a problem, which slowed down the main database. This slow-down caused a traffic jam on our web servers, which made the site unreachable. Because of the load on the back-up server, it took us longer than desired to be able to turn it off, at which point things stabilized a little bit. Once all of the webservers restarted the site returned back to normal.
So how are we going to ensure this doesn’t happen again? The good news is that we are about two weeks away from updating our entire server architecture to the latest and greatest that Amazon has to offer. This new architecture has a much better database structure, and this problem should not recur (that said, this problem in theory should not have occurred to begin with, which we will certainly be investigating fully).
In addition, we recently installed a new 24x7 global server operations team that that is laser-focused on getting our site reliability up to the highest levels possible. Our business, like your business, relies on it. Many unseen back end improvements have already been made, and the team helped getting today’s situation under control faster than ever.
We know this impacted your day and your experience, and we will continue to work to regain your confidence.