System Design: Data Reconciliation and Consistency in Migrations

When performing a zero-downtime migration using patterns like Dual-Write, you are essentially maintaining two separate copies of your data for a period. Despite your best efforts, these two systems can and will drift apart. A network glitch, a bug in the new service, or a brief outage can cause a write to fail on one system, leading to data inconsistency.

Data Reconciliation is the process of systematically finding and fixing these discrepancies. It acts as a safety net, ensuring that when you finally cut over to the new system, you can be confident that its data is a complete and accurate reflection of the old system. Without it, you risk data loss, incorrect analytics, and a poor user experience.

Why is Reconciliation Necessary?

Partial Failures: In a dual-write setup, a write to the new system might fail while the write to the old one succeeds. Even with a Dead Letter Queue (DLQ), the retry mechanism itself can fail.
Bugs: The new application logic might have subtle bugs that cause data to be written incorrectly.
Human Error: Manual interventions or backfills can introduce errors.
Backfill Gaps: The process of migrating historical data (the "backfill") might miss records that were updated at the exact moment the backfill was running.

Reconciliation is your tool to audit the migration's success and build trust in the new system.

Key Reconciliation Strategies

Reconciliation isn't a single technique but a collection of strategies that can be used together.

1. Checksums and Hashing

This is a high-level approach to quickly verify large datasets.

How it works:
1. Run a background job that calculates a checksum or hash of records in both the old and new databases.
2. You can group records by a certain time window (e.g., all orders from 10:00 AM to 11:00 AM) or by a logical shard (e.g., all users in a specific region).
3. Compare the checksums for each group. If they match, the data is likely consistent. If they don't, you've identified a "dirty" segment that needs a more detailed look.
Pros: Fast and efficient for identifying problem areas without comparing every single record.
Cons: Doesn't tell you which record is different, only that a difference exists within a group.

graph TD subgraph Old Database A[Record A1] B[Record A2] C[Record A3] end subgraph New Database D[Record B1] E[Record B2] F[Record B3 (Modified)] end OldDB_Group((Group A)) NewDB_Group((Group B)) A & B & C --> OldDB_Group D & E & F --> NewDB_Group OldDB_Group -- "Checksum: 0xABC123" --> Compare NewDB_Group -- "Checksum: 0xDEF456" --> Compare Compare{Checksums Match?} -- "No" --> Mismatch[Mismatch Detected!]

2. Full Data Comparison (Shadow Compare)

This is a more granular, row-by-row comparison.

How it works:
1. A background job reads records from the source of truth (the old database).
2. For each record, it reads the corresponding record from the new database.
3. It compares the two records field by field.
4. If a discrepancy is found, it logs the difference and, depending on the strategy, can automatically repair the new system's data.
Pros: Provides exact details of the inconsistency.
Cons: Can be very resource-intensive, both in terms of database load and network traffic. It's often run during off-peak hours.

3. Reconciliation Jobs

This is the process that acts on the discrepancies found by the methods above.

Read-Repair: When a discrepancy is found during a read operation, the system can fix it on the fly. For example, if a user requests data and the reconciliation logic finds it's missing from the new DB, it can copy it from the old DB before returning it.
Batch Repair: A dedicated background worker reads a log of known inconsistencies (generated by a checksum or shadow compare process) and applies the necessary fixes to the new database. This is the most common approach.

Go Example: A Simple Batch Reconciliation Script

This conceptual Go program demonstrates a batch reconciliation job. It reads records from a "source" (old DB) and a "destination" (new DB) and compares them. If a record is missing in the destination or the data doesn't match, it logs it for repair.

package main

import (
	"fmt"
	"log"
)

// Record represents a simple data record.
type Record struct {
	ID   int
	Data string
}

// DB represents a generic database.
type DB map[int]Record

// --- Simulate two databases ---

// SourceOfTruthDB is our old, reliable database.
var SourceOfTruthDB = DB{
	1: {ID: 1, Data: "apple"},
	2: {ID: 2, Data: "banana"},
	3: {ID: 3, Data: "cherry"}, // This will be missing in the new DB
	4: {ID: 4, Data: "date"},   // This will be modified in the new DB
}

// NewSystemDB is the database we are migrating to.
var NewSystemDB = DB{
	1: {ID: 1, Data: "apple"},
	2: {ID: 2, Data: "banana"},
	// Missing record with ID 3
	4: {ID: 4, Data: "durian"}, // Data is different
}

// ReconciliationJob runs the comparison logic.
func ReconciliationJob() {
	log.Println("Starting data reconciliation job...")
	
	// In a real system, you would iterate over keys or use a paginated query.
	for id, sourceRecord := range SourceOfTruthDB {
		
		// 1. Check if the record exists in the new database.
		destRecord, exists := NewSystemDB[id]
		if !exists {
			log.Printf("[MISMATCH] Record with ID %d is MISSING in the new database. Needs repair.", id)
			// Repair action: Copy the record
			// NewSystemDB[id] = sourceRecord
			continue
		}

		// 2. If it exists, compare the data.
		if sourceRecord.Data != destRecord.Data {
			log.Printf(
				"[MISMATCH] Record with ID %d has different data. Source: '%s', Destination: '%s'. Needs repair.",
				id, sourceRecord.Data, destRecord.Data,
			)
			// Repair action: Overwrite with source data
			// NewSystemDB[id] = sourceRecord
			continue
		}

		log.Printf("Record with ID %d is consistent.", id)
	}

	log.Println("Reconciliation job finished.")
}

func main() {
	ReconciliationJob()

	// After a hypothetical repair, the DBs would be in sync.
	// For example, if we uncomment the repair actions in the job:
	// NewSystemDB[3] = SourceOfTruthDB[3]
	// NewSystemDB[4] = SourceOfTruthDB[4]
	// fmt.Println("\nAfter repair, NewSystemDB:", NewSystemDB)
}

Handling Conflicts

What if a record has been updated in both the old and new systems in different ways? This is a classic data conflict. The right way to handle it depends on the business logic.

Source of Truth Wins: The simplest strategy. The data from the old system is considered correct, and the data in the new system is overwritten. This is the most common approach during a migration.
Last Write Wins (LWW): Use timestamps. The version of the record with the most recent timestamp is kept. This can be risky if clocks are not perfectly synchronized.
Application-Specific Logic: For complex cases, you may need to write custom logic. For an e-commerce order, you might merge the two versions or flag the order for manual review by a human.

Conclusion

Data reconciliation is not an optional add-on for a migration; it's a core component of the process. It provides the necessary checks and balances to ensure that a complex, multi-step migration doesn't lead to silent data corruption. By implementing strategies like checksums and shadow comparisons, and having a clear plan for repairing inconsistencies, you can build confidence in your new system and ensure a smooth, successful cutover with no data loss.