As a follow-up to yesterday’s incident, here is a more detailed description of what happened.
Last year, Google asked us to undergo an in-depth security assessment and audit, due to our heavy usage of the Gmail and Google Calendar APIs. We hired the well-known security firm Bishop Fox to handle this audit. This included heavy penetration testing (yes, that’s what it is called), meaning they were actively looking for security issues. We are happy to report that we passed this audit.
As part of this audit, Google recommended (but didn’t require) that we look into using HackerOne. This is a community of “white hat” hackers, meaning they will actively try to find security issues. Because we care deeply about security, and are only human, this is a good way to help the community find potential issues. This means that 17hats has been under constant friendly fire since last year. Other than occasionally stressing our servers (which allowed us to optimize certain areas), there have been no issues with this. Until yesterday.
Around 12.35 p.m. PST, we ourselves started receiving password reset notices. Like you, we were puzzled. They were legitimate emails too, but we certainly didn’t make the requests. The application servers were relatively quiet, and it doesn’t make sense for somebody trying to hack into 17hats to send out these notices (you don’t want to tip people off!). Then we did receive a notice from HackerOne that somebody was trying to break the login page, using fake data. That gave us a clue.
Digging deeper, we discovered that this person sent a test request, using empty objects. That should not be an issue, and typically isn’t. Except these empty data objects now caused the database to match all logins. Which still should not have been an issue if it simply had sent an email to the first account. But – and this was the real bug – the code then looped over all results (expecting only one result) and sent out an email to each record.
Now, this caused an overload issue on this process, and that meant the process stopped. Ironically, one of the suggestions from the security audit was to make sending password reset emails a back-end process.* In general, our back-end processes will self-heal and try again. In this case, it would simply resend the emails, until it stopped. Then it retried, etc.
The downside of fast servers is that when something like this happens, it goes wrong fast. By the time we had diagnosed the issue, written a fix, and updated the servers, hundreds of thousands of emails had already gone out. This was about an hour after we received the first email. What made matters worse is that our email host, SendGrid, understandably, queued the emails. Meaning that many hours after this was resolved, people were still receiving the emails, further causing a “what is going on?!” reaction.
We completely understand the frustration of receiving multiple emails, and having the fear that your account is being compromised. We can assure you this was not the case. It is because of our commitment to security that we joined HackerOne, and we look forward to working with its community to ensure your data is safe.
Thank you for your understanding, and your patience yesterday!
*The reason for using a back-end process was so that hackers could not use the timing of a request to determine whether they had entered a real email address that is in our system, versus one that is not. A back-end process means it is handed off to a different server in a queue, so the application server response time is unaffected. Had this remained on our application servers, it would have immediately failed, in all likelihood. It was this “improvement” that caused the above bugs to be exposed yesterday.