A Case Study in Zero-Downtime Migration
The team at Argos detailed its complex, zero-downtime migration of a 300GB PostgreSQL database from Heroku to AWS. The project, involving tables with 250 million rows, highlights key patterns for backend engineers, including continuous replication for live cutover and the critical role of extensive observability and rollback plans.
The migration was driven by the need for greater control and cost-efficiency as Argos scaled. Heroku's managed Postgres environment limited their ability to fine-tune the database, control replication directly, and scale storage independently of compute resources, leading to paying for unused capacity. A major technical hurdle was Heroku's restriction on direct logical replication. To overcome this, the Argos team had to coordinate with Heroku's support to get low-level Write-Ahead Log (WAL) archives exposed to an S3 bucket, which they then used to manually construct their replication pipeline. Their strategy involved a multi-phase approach using an intermediate EC2 instance as a bridge. They first restored a backup to this EC2 server and used the WAL files to catch up to the live Heroku database. The initial cutover pointed the application to this EC2 instance, achieving minimal downtime. Following the initial switch, this EC2 server acted as the new primary, from which they set up logical replication to the final destination: a managed AWS RDS instance. Once RDS was fully synchronized, the team performed the final switch by updating the application's database URL to point to RDS and decommissioning the intermediary server.