
CDC Pipelines Explained: Debezium, Kafka, and Warehouse MERGE Patterns
A CDC pipeline captures row changes in a source database, publishes those changes as events, and applies them to a warehouse table. Instead of reloading full tables, it moves only inserts, updates, and deletes. If you’re learning cdc pipeline data engineering, this is one of the clearest patterns to understand because it shows how modern platforms keep data fresh without overloading the source system.
The flow is simple once you see it end to end. Debezium reads database logs, Kafka carries the events, and the warehouse uses MERGE logic to keep target tables accurate. After that, CDC stops feeling abstract.
Key Points
- CDC reads database logs instead of scanning full tables on every run.
- Debezium turns row-level changes into structured events.
- Kafka buffers and delivers those events to downstream consumers.
- Warehouse MERGE logic decides whether each event inserts, updates, or deletes a row.
- Reliable CDC pipelines depend on keys, timestamps, deduping, and idempotent loads.
Quick takeaway: Kafka moves events, but warehouse logic decides what the final table should look like.
How change data capture works before any data moves
Full-table refreshes are easy to explain, but they don’t scale well. A batch job may reread millions of rows even when only a small slice changed. That burns compute, increases source load, and leaves reports stale for hours.
CDC takes a better route for many use cases. It reads the database’s transaction log, binlog, or write-ahead log and captures only committed changes. As a result, downstream systems receive smaller, faster updates, and the source database avoids repeated heavy queries.
A quick comparison makes the tradeoff clear.
| Approach | What moves | Source load | Freshness | Best fit |
| Full load | Entire table | High | Slow | Small tables, simple jobs |
| Incremental query | Rows matched by timestamp | Medium | Moderate | Basic pipelines |
| CDC | Inserts, updates, deletes from logs | Low | Near real-time | Analytics, sync, event-driven loads |
CDC wins on freshness and efficiency, but it asks more from the pipeline design.
Why teams choose CDC over batch refreshes
Teams adopt CDC because it cuts repeated work. The source database does less scanning, the network carries fewer bytes, and downstream tables update faster. That matters for dashboards, customer activity models, reverse ETL, and system sync jobs.
There are tradeoffs, though. CDC adds moving parts, and those parts can disagree. A pipeline has to handle duplicate messages, late-arriving events, and deletes that many batch jobs ignore. If the source table has weak keys or missing timestamps, trust drops fast.
The basic path from database change to warehouse row
A row changes in the source database first. Debezium reads that committed change from the log and publishes a structured event to Kafka. Next, a consumer or loader reads the event, lands it in the warehouse, and runs a MERGE into the target table.
That last step is where raw events become usable data. The warehouse decides which event is current, whether a delete should remove a row, and how to prevent the same change from inflating counts twice.
Where Debezium fits in the CDC pipeline
Debezium is the capture layer. It connects to supported databases, reads their change logs, and emits row-level events in a standard format. That removes the need to poll tables every few minutes and guess what changed since the last run.
Because Debezium reads the same logs the database uses to commit transactions, it usually gives a more complete record than timestamp-based incremental loading. That’s especially useful when deletes matter, or when multiple updates hit the same row in a short window.
What Debezium captures from source databases
Debezium captures inserts, updates, and deletes. It can also emit schema change events, depending on the connector and setup. Most change events include before and after values, an operation type, source metadata, and a timestamp.
That structure helps downstream consumers a lot. A warehouse loader doesn’t have to diff full rows or guess what happened. It can read the event and know whether the row is new, changed, or gone.
What can go wrong with Debezium if the source is not ready
Many CDC problems start at the source. Log retention must be long enough, or the connector can miss older changes. Permissions must be correct, or Debezium can’t read the log. Stable table and column names matter too.
Frequent schema changes can create constant downstream breakage. A renamed column or type change may break staging tables, consumers, or MERGE logic. Good CDC work starts with source discipline, not connector settings.
Why Kafka is the middle layer that keeps CDC events moving
Kafka is the transport and buffer between capture and load. It stores change events durably, lets consumers read at their own pace, and protects the pipeline when the warehouse slows down. Kafka does not clean or model data by itself. Its job is to hold records, move them reliably, and make replay possible.
That separation is what makes Debezium with Kafka useful. The source keeps producing events, while loaders and warehouses consume them on their own schedule.
How topics and partitions help scale event delivery
A Kafka topic is a named stream of events. Partitions split that stream into several ordered lanes, which lets consumers process more data in parallel. More partitions often improve throughput, but ordering only holds inside a single partition.
That detail matters for CDC. If all events for the same primary key land in the same partition, updates for that row stay in order. If keys scatter badly, an older update may arrive after a newer one at the consumer level.
Why Kafka is useful even when the warehouse is the final target
Kafka helps when the warehouse is slow, paused, or briefly unavailable. A loader can fall behind without forcing the source database to replay old work. Later, the consumer can catch up by reading stored events from Kafka.
Kafka protects delivery and replay. It does not protect table correctness. Keys, ordering rules, and idempotent MERGE logic still matter.
Kafka also gives teams a clean inspection point. You can test consumers, branch the same CDC stream to other systems, or debug bad loads without querying production tables.
How warehouse MERGE patterns turn events into clean tables
The warehouse is where CDC events become trusted tables. Most teams first land raw events in a staging table, then run a MERGE into the final target. MERGE matches incoming rows to existing rows by key and applies one action, insert, update, or delete. Snowflake, BigQuery, Databricks SQL, and other warehouses all support some version of this pattern.
So while Debezium and Kafka move the facts, the warehouse decides the final row state.
What MERGE does for upserts and deletes
For a new primary key or business key, MERGE inserts a row. If the key already exists, it updates the existing row. When the source event says the row was deleted, the target can delete it too, or mark it with a soft-delete flag.
Most teams also dedupe before the MERGE runs. If several update events arrive for the same key in one batch window, the loader often keeps only the latest valid event for that key.
Why idempotency matters in warehouse loading
CDC events can arrive more than once. A consumer may retry after failure, or a restart may reread older Kafka offsets. Therefore, the warehouse load must be idempotent, which means you can run it again and still get the same final table.
MERGE helps because it compares keys instead of blindly appending rows. Even then, idempotency depends on the rules around the MERGE. You need a reliable key, a way to identify the latest event, and a method for ignoring exact duplicates.
Common warehouse design choices that affect MERGE
Key design shapes the whole load. A bad key creates duplicates or merges unrelated rows. Timestamps matter too, because event time and load time are not the same thing. Late events should not overwrite newer state unless your rules allow it.
Many teams keep a raw landing table before the final MERGE. That makes replay, auditing, and debugging easier. If you only need the current version of each row, this often looks like an SCD Type 1 pattern with audit columns such as updated_at, deleted_flag, or source_event_ts.
The most common CDC pipeline mistakes and how to avoid them
Most CDC failures start small. A missed delete, a weak key, or a schema change may not break the job, but it can still corrupt the warehouse table. Green runs do not always mean correct data.
Missed deletes are common when loaders only handle inserts and updates. Out-of-order events can overwrite newer values with older ones. Schema drift can break staging jobs, while duplicate events inflate counts when MERGE logic is not safe to replay.
How to handle duplicates, late events, and schema changes
Use a stable key for matching. Keep an event timestamp or log position, then define a clear winner when several events exist for the same row. Store raw events before transformation if possible, because replay is the fastest way to recover after a bad load.
For schema changes, validate events before the MERGE step. If a column type changes or a required field disappears, route that data to review instead of forcing it into a broken target table. Good pipeline design matters more than any single tool in the stack.
FAQ
Do I need Kafka for every CDC pipeline?
No, you don’t. Some teams send CDC data straight into a warehouse through managed services or simpler connectors. Kafka becomes most useful when you need buffering, replay, multiple consumers, or loose coupling between capture and load. For a tiny one-off pipeline, it may be more infrastructure than you need.
Can MERGE handle deletes from CDC events?
Yes, but only if your loader keeps delete events and your SQL handles them. Many broken pipelines capture inserts and updates but skip deletes. Then the warehouse keeps rows that no longer exist in the source. Always test delete behavior with replay, backfills, and point-in-time checks.
How do I stop duplicate rows in a CDC warehouse table?
Start with a stable key and idempotent load logic. Then dedupe staged events by key plus event time, log position, or transaction metadata. When retries resend the same event, the MERGE should land on the same row state instead of creating a second copy.
What’s a good first project for learning Debezium and Kafka?
Use one PostgreSQL or MySQL table with a real primary key. Capture its changes with Debezium, publish them to one Kafka topic, land them in a raw warehouse table, and write a MERGE into a current-state model. That project teaches the hard parts without too much setup noise.
Conclusion
A strong CDC pipeline is a chain of trust. Debezium captures committed changes, Kafka keeps those events moving safely, and warehouse MERGE logic turns them into clean tables. If one piece is weak, the final table drifts.
The best way to learn this is to build a small version yourself. Start with one source table, send changes through Kafka, land them raw, and write a MERGE that handles inserts, updates, deletes, and retries.
If you want guided practice, Data Engineer Academy’s DE Projects Course is a practical next step because it turns these patterns into portfolio-ready work.

