Technology·13 min read

Database Migration Strategies That Won't Destroy Your Product

Most database migrations fail not because the SQL is wrong, but because the deployment strategy is wrong. Here is how to change your schema in production without taking your app offline or corrupting data.

N

Nate Laquis

Founder & CEO ·

Why Database Migrations Break Production

A database migration is not just a SQL file. It is a live operation on a running system that real users are hitting right now. The failure modes are ugly: table locks that freeze all writes for 30 seconds, column renames that break running app instances, data transformations that timeout halfway through and leave your table in an inconsistent state.

These are not hypothetical disasters. They happen to teams that apply migrations the way they learned in tutorial land: stop the app, run the migration, restart the app. That works fine in development. In production it means downtime, data loss, and a very bad morning.

The root cause of most migration failures is treating schema changes as atomic when they are not. Your app has multiple instances running. Your database has millions of rows. Your migration script runs in seconds but the data transformation runs for hours. None of these realities exist in local development, which is exactly why local testing gives teams false confidence.

Database server infrastructure running live production migrations

This guide is for teams running PostgreSQL in production, though most of the patterns apply equally to MySQL and MariaDB. We will cover the specific DDL operations that cause table locks, the backward-compatibility rule that makes zero-downtime deploys possible, and the exact playbook for common schema changes like adding columns, removing columns, renaming things, and migrating large datasets.

Table Locks: The Hidden Danger in Simple Migrations

PostgreSQL acquires an AccessExclusiveLock for most DDL operations. This lock blocks every other operation on the table: reads, writes, and even other lock acquisitions. While the lock is held, any query touching that table waits in a queue. If the migration takes 5 seconds, you have a 5-second window where your app is effectively offline for that table.

On a small table with 10,000 rows, most operations complete in milliseconds and the lock is barely noticeable. On a table with 5 million rows, the same operation might hold the lock for 30 seconds or more. On a table with 50 million rows, you are looking at minutes.

Operations That Acquire Exclusive Locks in PostgreSQL

  • ADD COLUMN with a DEFAULT value (before PostgreSQL 11): Adding a column with a non-null default required rewriting the entire table row by row. PostgreSQL 11 fixed this for most cases, but if you are on an older version or using a DEFAULT that is not a constant, you still get a full table rewrite.
  • ADD COLUMN NOT NULL without a default: Same problem. PostgreSQL must verify every row satisfies the constraint, which requires a full table scan under an exclusive lock.
  • DROP COLUMN: Acquiring an exclusive lock, marking the column as dropped in the catalog. Fast in terms of lock duration, but you cannot add a column of the same name immediately.
  • CREATE INDEX (non-concurrent): Blocks all writes for the entire index build. On a 10M-row table, that could be 10 to 60 minutes.
  • ALTER COLUMN TYPE: Full table rewrite. One of the most dangerous operations in production.
  • RENAME TABLE or RENAME COLUMN: Fast operation but breaks any running app instance that references the old name.

The lock wait itself creates a secondary problem. If your migration is waiting to acquire its lock (because the table is busy), every subsequent query against that table queues up behind it. You can have a situation where one slow migration queues 500 application queries, all of which timeout after 30 seconds, creating a cascade of failures that looks like a complete outage.

The lock_timeout Setting

Always set lock_timeout before running migrations in production. This causes the migration to fail immediately if it cannot acquire the lock within the specified time, rather than queueing indefinitely. A good default: SET lock_timeout = '5s'. If your migration cannot get the lock in 5 seconds, it aborts cleanly. Your app stays up, and you can retry during a lower-traffic window.

The Golden Rule: Backward-Compatible Changes

The single most important principle in zero-downtime database migrations is this: your database schema must be compatible with the version of the app that is currently running AND the version that is about to be deployed, at the same time.

During a rolling deployment, you have both versions running simultaneously. The old version and the new version are both reading and writing the same database. Any schema change that breaks the old version will cause errors the moment you start deploying. Any schema change that the new version depends on before it exists will cause errors immediately after deployment.

Changes That Are Always Safe

  • Adding a nullable column with no default: Old app instances ignore the column. New app instances can read and write it. Safe in a single migration.
  • Adding a new table: Old instances do not know it exists. No conflicts.
  • Adding a new index: Transparent to the application. Use CREATE INDEX CONCURRENTLY to avoid locks.
  • Widening a column type: Changing VARCHAR(100) to VARCHAR(255) does not affect existing data or queries.

Changes That Require Multiple Deployment Phases

  • Renaming a column: Phase 1: add the new column. Phase 2: deploy app that writes to both columns. Phase 3: backfill old column data into new column. Phase 4: deploy app that only reads from new column. Phase 5: drop old column. This is 5 separate deployments for a column rename.
  • Removing a column: Phase 1: deploy app that stops reading from the column. Phase 2: run the migration to drop the column. Reversing this order breaks the app immediately.
  • Adding a NOT NULL constraint to an existing column: The column may have null values in existing rows. You must backfill first, then add the constraint.
  • Changing a column type: Requires a full column rename/replace cycle or careful use of triggers to dual-write during transition.
Developer reviewing database schema changes for backward compatibility

The multi-phase approach feels slow. It is. A simple column rename that would take 30 seconds in development takes 3 days in production if you are doing rolling deploys carefully. That is the correct tradeoff. The alternative is a deployment that corrupts data or takes the app down.

Zero-Downtime Patterns in PostgreSQL

Here are the specific techniques for the most common schema operations, with the exact SQL and the reasoning behind each step.

Adding a Column Safely

The simple case: adding a nullable column. This is fast and safe on any PostgreSQL version.

ALTER TABLE users ADD COLUMN middle_name VARCHAR(100);

The dangerous case: adding a column with NOT NULL and no default. Never do this on a large table in production. Instead:

  • Add the column as nullable: ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
  • Backfill existing rows in batches: UPDATE users SET email_verified = false WHERE id BETWEEN 1 AND 10000 AND email_verified IS NULL; (repeat for all rows)
  • Add the constraint using VALIDATE CONSTRAINT (described below).

CREATE INDEX CONCURRENTLY

Standard CREATE INDEX holds an exclusive lock for the entire index build. CREATE INDEX CONCURRENTLY builds the index without blocking reads or writes. It takes roughly twice as long because it scans the table multiple times, but it does not interrupt your application.

CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

Caveats: you cannot run CONCURRENTLY inside a transaction block. If the index build fails, it leaves behind an invalid index that you need to manually drop and retry. Always check for invalid indexes after migrations: SELECT indexname FROM pg_indexes JOIN pg_index ON pg_index.indexrelid = pg_indexes.indexname::regclass WHERE indisvalid = false.

Adding NOT NULL Constraints Safely

The naive approach locks the table while PostgreSQL validates every row. PostgreSQL 12+ introduced a safer pattern using CHECK constraints:

  • Step 1: Add a NOT VALID check constraint. This validates only new rows, not existing rows, so it acquires a minimal lock: ALTER TABLE orders ADD CONSTRAINT orders_status_not_null CHECK (status IS NOT NULL) NOT VALID;
  • Step 2: Backfill existing null rows with correct values.
  • Step 3: Validate the constraint. This acquires a ShareUpdateExclusiveLock (which does not block reads or writes) and validates all existing rows: ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;
  • Step 4: Once validated, you can optionally set the column NOT NULL. PostgreSQL is smart enough to skip re-validating if the check constraint already exists and is valid.

Dropping Columns

Dropping a column is fast in PostgreSQL (it just marks the column as dropped in the system catalog), but you must make sure no running code references the column first. The safe order: deploy the code change that stops using the column, then run the migration. Never the other way around.

Data Migrations: Moving Millions of Rows Safely

Schema migrations change the structure of your tables. Data migrations transform the data inside them. Both can cause production problems, but data migrations have an additional failure mode: they can run for hours, lock rows, and fail halfway through, leaving your data in an inconsistent state.

The Batched Update Pattern

Never run a single UPDATE on a large table. UPDATE orders SET status = 'pending' WHERE status IS NULL will lock every matched row for the entire duration of the query. On 5 million rows, that is a very long time.

Instead, batch the update in chunks, processing 1,000 to 10,000 rows at a time with a short pause between batches:

DO $$ DECLARE last_id BIGINT := 0; BEGIN LOOP UPDATE orders SET status = 'pending' WHERE id IN (SELECT id FROM orders WHERE status IS NULL AND id > last_id ORDER BY id LIMIT 5000); EXIT WHEN NOT FOUND; SELECT MAX(id) INTO last_id FROM orders WHERE id > last_id ORDER BY id LIMIT 5000; PERFORM pg_sleep(0.1); END LOOP; END $$;

The 100ms pause between batches gives other queries a chance to run and prevents the migration from consuming 100% of database I/O. The whole thing takes longer, but it does not interrupt your application.

The Dual-Write Pattern

When you need to migrate data to a new structure (moving from a single address field to a structured address table, for example), the dual-write pattern allows you to do it without downtime:

  • Phase 1: Create the new table or column. Deploy code that writes to both the old and new structure on every insert and update.
  • Phase 2: Run a background job to backfill historical data from the old structure to the new structure. This can take hours or days. Your app continues running throughout.
  • Phase 3: Once backfill is complete and you have verified the new structure contains correct data, deploy code that reads from the new structure only but still writes to both (in case of rollback).
  • Phase 4: After confirming the new code is stable, deploy code that only writes to the new structure. Drop the old structure.

This pattern adds complexity but is the only reliable way to migrate large datasets without downtime or data loss risk.

Handling Migration Failures Midway

Batched data migrations should be idempotent: running them twice produces the same result as running them once. If your migration is UPDATE orders SET normalized_status = LOWER(status) WHERE normalized_status IS NULL, running it twice is safe because the WHERE clause skips already-migrated rows. Design every data migration to be resumable and idempotent from the start.

Engineer running a batched database data migration with progress monitoring

Migration Tooling: Prisma, Drizzle, and Flyway

Most teams use an ORM or migration framework rather than writing raw SQL files. Each tool has different philosophies and different failure modes.

Prisma Migrate

Prisma generates SQL migration files from your schema.prisma definition. You define your schema in a TypeScript-friendly DSL, run npx prisma migrate dev to generate the migration file in development, and npx prisma migrate deploy to apply it in production.

The upside: Prisma tracks applied migrations in a _prisma_migrations table, warns about migration drift, and integrates cleanly with TypeScript type generation. The downside: Prisma generates naive SQL that does not account for zero-downtime concerns. A simple column rename in schema.prisma generates a DROP COLUMN and ADD COLUMN, which loses all existing data. You often need to edit the generated SQL file before running it.

For production use, always review the generated migration file before deploying. Run npx prisma migrate diff to preview the SQL before generating the file. Never run prisma migrate dev in production (it can reset migrations). Use prisma migrate deploy in your CI/CD pipeline.

Drizzle Kit

Drizzle is a newer ORM with a schema-first approach similar to Prisma but with explicit SQL control. drizzle-kit generate creates migration SQL files that you can edit freely. Drizzle does not try to be clever about your migrations, which means less magic and more control.

Drizzle is the better choice if your team wants fine-grained control over migration SQL and is comfortable writing and reviewing raw SQL. Prisma is better for teams that want more abstraction and are primarily working with straightforward schema changes.

Flyway

Flyway is a Java-based migration tool that works with any database and any application stack. You write plain SQL files named with a version prefix (V1__create_users.sql, V2__add_email_column.sql) and Flyway applies them in order, tracking applied migrations in a flyway_schema_history table.

Flyway gives you complete control over the SQL, supports repeatable migrations (for views and stored procedures), and has a large production track record. The downside is that it adds a Java dependency and requires more manual process than ORM-integrated tools.

For teams running PostgreSQL with Node.js, Prisma or Drizzle is usually the right choice. For multi-language stacks or teams with a strong SQL culture, Flyway is excellent. Avoid tools that auto-apply migrations on app startup (like Sequelize's sync) in production: you lose visibility and control over what is being applied.

Testing Migrations Before Production

The most dangerous migration is one that has never run against production-scale data. A migration that takes 50 milliseconds on your 10,000-row development database may take 45 minutes on your 50-million-row production table.

Use a Production Clone

The only reliable way to test migration performance is to run it against a copy of your production database with production-scale data. Most cloud providers make this straightforward. On AWS RDS, you can create a snapshot and restore it to a new instance in 15 to 30 minutes. On Supabase, branching creates a copy of your database for testing.

The process: create a snapshot of production, restore it to a staging database of the same instance size, run your migration, and measure how long it takes and whether it affects query performance. If it takes 30 seconds on the clone, expect 30 seconds in production under similar load (usually worse because production has concurrent queries).

Measure Lock Duration

While your test migration runs, run this query in a separate session to see what locks are being held:

SELECT pid, mode, relation::regclass, granted FROM pg_locks JOIN pg_stat_activity USING (pid) WHERE relation IS NOT NULL;

This shows you exactly which locks your migration acquires and how long it holds them. If you see AccessExclusiveLock on a table for more than a second, investigate whether a concurrent or batched approach is available.

Test the Rollback

Know exactly how you will undo the migration if something goes wrong. For schema changes, this means having a rollback SQL file ready. For data migrations, it means understanding whether the transformation is reversible. Some migrations are not reversible (you cannot un-delete data you dropped). Make sure you have a backup before running irreversible migrations.

Automate Migration Checks in CI

Tools like squawk (a PostgreSQL migration linter) can catch dangerous migration patterns before they reach production. Add squawk to your CI pipeline to automatically flag operations like ADD COLUMN NOT NULL without a default, non-concurrent index creation, and missing lock timeouts. This catches the obvious mistakes before a human review even happens.

Emergency Procedures: When a Migration Goes Wrong

Even with all the right precautions, migrations sometimes go wrong in production. Having a plan before things go wrong is what separates a 10-minute incident from a 4-hour outage.

If the Migration Is Hung

If your migration is blocking other queries, identify the locking process and decide whether to kill it:

  • Find blocking processes: SELECT pid, query, state, wait_event_type, wait_event FROM pg_stat_activity WHERE state != 'idle' ORDER BY wait_event_type;
  • Kill the migration if it is safe to abort: SELECT pg_terminate_backend(pid); where pid is the migration's process ID.
  • After terminating, the migration rolls back automatically. Your app recovers as soon as the lock is released.

If Data Was Corrupted

This is the worst case. If a migration ran a bad UPDATE or DELETE, your first move is to stop all writes to the affected tables if possible. If you have point-in-time recovery enabled (you should), you can restore to a snapshot from before the migration ran. On AWS RDS, point-in-time recovery lets you restore to any second within your retention window. This is not fast (30 to 60 minutes for a restore) but it is how you recover from catastrophic data loss.

If the Rollback Itself Fails

Rollback SQL can also have problems, especially if the forward migration partially completed. The safest approach for complex migrations is a pre-migration backup at the database level (RDS snapshot, pg_dump for smaller databases) that you can restore independently of your migration tooling.

Post-Incident Process

After any migration incident, document what happened, what the migration was supposed to do, why it went wrong, and what change to your process would have prevented it. Migration incidents are almost always process failures (no production-scale testing, no lock timeout, no rollback plan) rather than fundamental mistakes. The fix is process, not just fixing the SQL.

If your team is dealing with a complex migration on a large production database and wants a second opinion before running it, we can review your migration plan and flag risks before they become incidents. Book a free strategy call and we will walk through your migration strategy together.

Need help building this?

Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.

database migrationzero-downtime deploymentschema migrationPrisma migrationsdatabase management

Ready to build your product?

Book a free 15-minute strategy call. No pitch, just clarity on your next steps.

Get Started