**Hightouch's Encounter with AWS Aurora Race Condition: Lessons from a Database Upgrade Ordeal**
In the fast-paced world of cloud-based data systems, reliability and scalability are paramount. Hightouch, a company specializing in event data collection and synchronization, recently faced a series of unexpected technical challenges while navigating the aftermath of a major AWS outage and a subsequent infrastructure upgrade. Their experience not only revealed a rare bug within AWS Aurora but also offered valuable insights into the complexities of cloud infrastructure operations.
**Background: The October 20th AWS Outage**
On October 20, a significant AWS outage affected the us-east-1 region. The root cause was a race condition bug in one of AWS’s DNS management services. For Hightouch, whose Events product relies heavily on AWS services, this outage led to a massive processing backlog. Their system, which collects and centralizes behavioral data such as page views, clicks, and purchases for clients, was suddenly stretched to its limits. While the underlying Kafka infrastructure ensured that no events were lost, the sheer volume of accumulated events threatened to overwhelm their existing database resources, particularly a small Postgres instance used as a virtual queue for batch metadata.
**Architecture and Scaling Challenges**
Hightouch’s architecture is designed for scalability along three main axes: Kubernetes clusters for event collection and batch processing, Kafka for streaming and durable event handling, and Postgres (hosted on Amazon Aurora) for managing queue metadata. Historically, Postgres queues had served them well, handling millions of syncs per day and scaling up to 500,000 events per second with low latency.
However, the extraordinary backlog created by the AWS outage exposed the limitations of their current setup. To handle such spikes in workload and to provide more "headroom" for future processing needs, the team decided it was necessary to upgrade their Aurora database instance.
**The Aurora Upgrade Plan**
Amazon Aurora, AWS’s managed database service, offers a different architecture from standard PostgreSQL by separating compute and storage. This setup allows for rapid failovers and efficient scaling. In Aurora, a failover operation involves promoting a read replica to the role of primary writer—a process that can be triggered automatically in case of failure or manually during upgrades and maintenance.
With confidence gained from successful staging tests, Hightouch planned and initiated the Aurora upgrade on October 23. The process was supposed to be seamless: promote the upgraded instance as the new writer, let the old primary become a replica, and keep the system running without downtime.
**Unexpected Failover Failures**
At 4:39 PM EDT on October 23, the team triggered the failover. Initially, everything seemed normal—the AWS Console reflected the expected status changes. But upon completion, the original writer instance was still primary, and the failover appeared to have reversed itself. All indicators in AWS showed the cluster as healthy, but Hightouch’s backend services were unable to execute write queries. Restarting services cleared the errors temporarily, but the upgrade itself
