US-EAST 1 Outage

Incident Report for Status for Fathom Analytics

Resolved

Amazon Web Services' US-EAST-1 was offline for around 1.5 hours today. Things were completely out of our hands and pageviews were lost. We were unable to do anything during the outage, and had to wait for Amazon to fix things because our us-east infrastructure is the core to everything we do (our EU isolation proxy is separate but, after it removes personal data, it still hits us-east-1).

We run Fathom from multiple availability zones and we pay a premium to do that, because we care about availability. Outside of a DDoS attack back in 2020, we've never seen downtime like this where everything (even the pageview collection) was taken down. In this scenario, despite us having multiple availability zones set-up, the entire region's Lambda (our compute) collapsed.

We've also identified inadequacies in our status page, where we need to move towards automating updates the minute something goes ofline, so you know that we're aware. In addition, this has brought service availability (something we obsess about) to the top of our priority list and we will be making changes.

We had planned to move Fathom's ingest to multiple regions later this year, and we've now bumped up the priority there. We're so sorry for the outage and we'll continue to invest in making our service even more resilient.

Posted Jun 13, 2023 - 09:44 PDT