system design

Architectures for Real-Time Data Processing

#system design#architecture pattern#stream processing#lambda architecture#kappa architecture#big data

A look at common architectural patterns for real-time data processing, including Lambda and Kappa architectures, and their trade-offs.

Introduction: The Need for Speed and Accuracy

As we've seen, stream processing allows us to analyze data with very low latency. However, building a system that is both fast and accurate, fault-tolerant, and capable of handling historical data is a significant architectural challenge.

Early real-time systems often faced a difficult trade-off:

  • Stream processing systems were fast but were considered less reliable and struggled with re-processing historical data or correcting errors.
  • Batch processing systems were reliable and could process huge volumes of data accurately, but they were slow.

To address this, two primary architectural patterns have emerged for designing end-to-end data processing systems: the Lambda Architecture and the Kappa Architecture.

The Lambda Architecture: A Hybrid Approach

The Lambda Architecture, proposed by Nathan Marz, combines batch and stream processing into a single, comprehensive system. The core idea is to get the best of both worlds: the low-latency of a real-time system and the accuracy and completeness of a batch system.

It consists of three layers:

  1. Batch Layer (Cold Path): This is the canonical source of truth. It stores the entire, immutable master dataset. Periodically, it runs batch jobs to re-compute a comprehensive set of "batch views" from all the data. This layer prioritizes accuracy and completeness over speed.
  2. Speed Layer (Hot Path): This layer processes data in real-time as it arrives. It sacrifices some accuracy for extremely low latency. It generates "real-time views" that are based only on recent data.
  3. Serving Layer: This layer responds to queries. To provide a complete answer, it merges results from both the batch views and the real-time views.
graph TD subgraph "Data In" Stream[Event Stream] end subgraph "Lambda Architecture" BatchLayer[Batch Layer<br/>Master Dataset] SpeedLayer[Speed Layer<br/>Real-Time Processing] ServingLayer[Serving Layer<br/>Merges Views] Stream --> BatchLayer Stream --> SpeedLayer BatchLayer -->|Batch Views| ServingLayer SpeedLayer -->|Real-Time Views| ServingLayer end subgraph "Queries" Query[Query] Result[Result] Query --> ServingLayer ServingLayer --> Result end

How it Works in Practice

Imagine you are building a real-time analytics dashboard for a website.

  • The Batch Layer might run a job every night to calculate the exact number of unique visitors for all previous days.
  • The Speed Layer processes today's clicks in real-time, using a fast but potentially less accurate algorithm (like HyperLogLog) to estimate the number of unique visitors for the current day.
  • When a user queries the dashboard, the Serving Layer fetches the exact numbers from the batch view (for yesterday and before) and merges them with the real-time estimate for today.

Pros and Cons of Lambda

  • Pros:
    • Extremely robust and fault-tolerant. Errors in the speed layer are temporary, as they will be corrected by the next batch run.
    • Supports a wide variety of use cases, from historical analysis to real-time alerting.
  • Cons:
    • Complexity: This is the biggest drawback. You have to build, maintain, and manage two separate systems with different codebases for processing the same data.

The Kappa Architecture: Simplifying with a Single Path

The Kappa Architecture, proposed by Jay Kreps, emerged as a simplification of the Lambda Architecture. It argues that if your stream processing system is good enough, you don't need a separate batch layer.

The core idea is to treat everything as a stream. The Kappa Architecture uses a single stream processing pipeline to handle both real-time data processing and historical re-processing.

How it Works

  1. Canonical Store: A scalable, durable event streaming platform (like Apache Kafka) is used as the canonical store. All data is ingested into this log, which can retain data for long periods (days, months, or even forever).
  2. Stream Processing Layer: A single stream processing framework (like Apache Flink or ksqlDB) reads from the event log to generate real-time views.
  3. Re-processing: If you need to re-compute your views (e.g., because of a bug fix in your code), you simply deploy a new version of your stream processor and have it re-read the data from the beginning of the event log.
graph TD subgraph "Data In" Stream[Event Stream] end subgraph "Kappa Architecture" Log[Canonical Event Log<br/>e.g. Kafka] StreamProcessor[Stream Processing Engine<br/>e.g. Flink] ServingViews[Serving Views] Stream --> Log Log -->|Reads from log| StreamProcessor StreamProcessor -->|Creates/Updates| ServingViews end subgraph "Queries" Query[Query] Query --> ServingViews end

Pros and Cons of Kappa

  • Pros:
    • Simplicity: You only need to maintain one codebase and one processing framework. This dramatically reduces operational overhead.
  • Cons:
    • Tooling Dependency: It relies heavily on having a stream processing system that is mature enough to handle both real-time processing and fast, large-scale re-processing from the log.
    • Re-processing large volumes of historical data can still be computationally expensive and time-consuming.

Lambda vs. Kappa: Which to Choose?

Feature Lambda Architecture Kappa Architecture
Complexity High (two separate codebases) Low (one codebase)
Core Idea Batch + Real-time Everything is a stream
Source of Truth Immutable Master Dataset (HDFS, S3) Replayable Event Log (Kafka)
Best For Systems where batch and real-time logic are fundamentally different. Systems where most logic can be expressed as a stream transformation.

Today, with the maturity of stream processing frameworks like Flink, Spark Streaming, and Kafka Streams, the Kappa Architecture is becoming the more popular choice for new projects. The simplicity of maintaining a single processing pipeline is a huge advantage.

However, the Lambda Architecture is still relevant, especially in large enterprises with existing, well-established batch processing systems (like Hadoop MapReduce or Spark batch jobs). It can serve as a practical bridge, allowing them to add real-time capabilities without replacing their entire batch infrastructure.

Go Example: Conceptual Kappa-Style Processor

This example simulates the core idea of a Kappa architecture: a single processor that builds a view from a stream and has the ability to "re-process" from the beginning.

package main

import (
	"fmt"
	"sync"
)

// Event represents a record in our log.
type Event struct {
	Key   string
	Value int
}

// EventLog simulates a durable, replayable log like Kafka.
type EventLog struct {
	events []Event
}

// View is the materialized result of our processing.
type View struct {
	mu   sync.Mutex
	data map[string]int
}

// StreamProcessor reads from the log and updates the view.
type StreamProcessor struct {
	log  *EventLog
	view *View
}

func (p *StreamProcessor) Process(event Event) {
	p.view.mu.Lock()
	defer p.view.mu.Unlock()
	// Simple aggregation: sum values by key
	p.view.data[event.Key] += event.Value
	fmt.Printf("Processed event for key '%s'. View value is now: %d\n", event.Key, p.view.data[event.Key])
}

// Replay re-processes all events from the beginning of the log.
func (p *StreamProcessor) Replay() {
	fmt.Println("\n--- Replaying all events ---")
	// Reset the view before replaying
	p.view.data = make(map[string]int)
	for _, event := range p.log.events {
		p.Process(event)
	}
	fmt.Println("--- Replay complete ---")
}

func main() {
	log := &EventLog{
		events: []Event{
			{Key: "A", Value: 10},
			{Key: "B", Value: 5},
			{Key: "A", Value: 2},
		},
	}
	view := &View{data: make(map[string]int)}
	processor := &StreamProcessor{log: log, view: view}

	// Initial processing
	fmt.Println("--- Initial Processing ---")
	for _, event := range log.events {
		processor.Process(event)
	}
	fmt.Printf("Initial View: %+v\n", processor.view.data)

	// Imagine we found a bug in our logic and deployed a new version.
	// We would now trigger a replay to rebuild the view correctly.
	processor.Replay()
	fmt.Printf("View after replay: %+v\n", processor.view.data)
}

Conclusion

Choosing the right data processing architecture is a critical design decision.

  • The Lambda Architecture offers a path to combining the reliability of batch processing with the speed of stream processing, at the cost of high complexity.
  • The Kappa Architecture offers a much simpler, unified model, but requires mature streaming tools that can handle re-processing.

As streaming technology continues to advance, the lines are blurring, but understanding these two foundational patterns provides a clear framework for thinking about how to build robust, scalable, and accurate data systems.