A strong answer in a design CDC pipeline interview starts with the core job: capture row-level changes from a source system, move them safely and in order, and land them in a warehouse or lake with low delay and high trust. CDC, or change data capture, means tracking inserts, updates, and deletes instead of reloading whole tables.
In an interview, the goal is bigger than naming Kafka, Debezium, or AWS DMS. You need to show how you think about requirements, tradeoffs, failure handling, and scale.
That thinking starts before you draw a single box.
Key Points
- A good CDC design begins with clear requirements, not tool names.
- Log-based capture is usually the best default for busy databases.
- Kafka helps with buffering, replay, ordering by key, and fan-out.
- At-least-once delivery with idempotent writes is often the best interview answer.
- Reliable CDC pipelines need replay, monitoring, schema controls, and safe failure paths.
What the interviewer wants to hear before you draw the architecture
Strong candidates don’t jump into a change data capture system design with a pre-built stack. They slow down, ask sharp questions, and define success first. That signals judgment.
The five questions that shape the design
- How fresh must the data be? Seconds pushes you toward streaming, while minutes may allow micro-batches.
- What are the source and target systems? MySQL to Redshift is different from PostgreSQL to S3 and Athena.
- Do updates and deletes matter? If deletes matter, you need tombstones or delete flags all the way through.
- How much data changes per second? High write volume affects connector choice, partitioning, and sink load strategy.
- What happens if data is late or duplicated? That answer drives idempotency, replay, and checkpoint design.
Turn vague requirements into clear success criteria
Turn soft phrases into simple targets. “Near real-time” can mean under 60 seconds end-to-end. “Reliable” can mean at-least-once delivery with replay from saved offsets. “Low cost” can mean one shared stream and batch merges every few minutes.
A clean interview answer sounds like this: data should arrive within one minute, preserve per-key ordering, tolerate duplicate events, support schema evolution, and recover without data loss after a connector restart. That’s clear, testable, and easy to defend.
Build the CDC pipeline step by step from source to target
Most interviewers want a clean flow: source capture, event transport, light processing, storage, and serving. Keep the path simple, then add safeguards.
How change data gets captured from the source database
CDC usually starts at the database log, such as MySQL binlog or PostgreSQL WAL. Log-based capture reads committed changes with less overhead than triggers, so it’s the usual default for scale. Triggers can work, but they add write-path cost and operational risk.
You also need an initial snapshot. That backfill loads current table state first, then log reading catches new changes after the snapshot point.
Why a message bus such as Kafka often sits in the middle
Kafka often appears in a kafka CDC interview because it solves several problems at once. It buffers bursts, stores events for replay, and lets multiple consumers read the same stream. One consumer can feed a warehouse, while another updates a cache or search index.
Partition by a stable key, often the primary key, when you need in-order updates for the same row. You won’t get total global order, but you usually don’t need it.
What the processing layer should do before loading data
Keep processing light unless the business rules are heavy. Validate required fields, check schema versions, drop or quarantine bad records, and deduplicate when needed. If enrichment is small, a stream processor or Lambda-style function may be enough.
If sink writes are expensive, micro-batching can help. Group events into short windows, then write fewer, larger merge operations.
How the target system should receive clean, usable data
The sink depends on the use case. Warehouses like Snowflake, BigQuery, and Redshift often load into staging tables, then run merge or upsert logic. Lakehouse patterns may land raw CDC events in S3, then build clean tables with Iceberg, Delta Lake, or Hudi.
Deletes need an explicit plan. Some teams apply hard deletes, while others mark rows as deleted and preserve history for audits or slowly changing dimensions.
Explain the tradeoffs that make your design look senior
Senior answers sound calm because they pick sensible defaults and explain why. You don’t need the fanciest semantics. You need the safest fit.
How to handle duplicates without breaking the pipeline
Duplicates happen in real systems. Retries, consumer restarts, and sink timeouts can all replay the same change. Design for that from the start.
Use an event key, a source log position, or a commit timestamp plus primary key to make writes idempotent. Store checkpoints, and deduplicate within a time window or by a unique change identifier when the source provides one.
In interviews, “at-least-once delivery plus idempotent writes” is usually the best default.
What to do about schema changes and evolving tables
Schema drift breaks downstream jobs when teams ignore it. New columns are usually easy if consumers can ignore unknown fields. Dropped columns and type changes are harder because they can break merge logic and dashboards.
A schema registry or contract rules help. In the interview, say you would version schemas, validate producers, and alert on breaking changes before they reach the warehouse.
When to choose batch, micro-batch, or streaming CDC
Use this quick comparison when you discuss latency, cost, and complexity.
| Style | Typical latency | Cost and ops | Good fit |
| Batch | Minutes to hours | Lowest | Nightly loads and low-priority reporting |
| Micro-batch | 30 seconds to 5 minutes | Medium | Standard warehouse sync and dashboards |
| Streaming | Seconds | Highest | Alerts, user-facing features, operational sync |
Most interview answers land well with micro-batch or streaming, depending on freshness. Pick batch only when latency is loose and simplicity matters more than speed.
Show how you would make the pipeline reliable and safe in production
A good diagram without operations is half an answer. Interviewers expect retries, replay, monitoring, and clear failure paths.
The failure modes interviewers expect you to spot
Source outages stop new changes, so the connector should resume from the last offset when the source returns. Connector failures need restart logic and alerting. Lag spikes need backpressure handling, larger consumer capacity, or slower sink writes.
Poison messages should go to a dead-letter queue after limited retries. Warehouse load errors should not block the whole stream forever, so stage failed batches and replay them. Bad schema updates should fail fast before they fan out.
The metrics that prove the pipeline is healthy
Track a small set of metrics that explain both speed and correctness:
- Consumer lag
- Events processed per second
- End-to-end freshness
- Error rate and dead-letter volume
- Duplicate rate after deduplication
- Sink load success and retry counts
Those metrics help on-call teams answer the only question that matters at 2 a.m., which is whether the data is late, wrong, or both.
Walk through a sample interview answer from requirements to final design
The best interview answer sounds like spoken reasoning, not a memorized script. Keep it short and ordered.
A simple whiteboard story you can tell in under five minutes
Start with requirements. Say you want freshness, source type, change volume, delete handling, and recovery expectations. Then propose a simple design: log-based CDC from the source database, events into Kafka, a small processing layer for validation and deduplication, and staged loads into a warehouse with merge logic.
After that, cover edge cases. Explain per-key ordering through partitioning, at-least-once delivery with idempotent sink writes, schema version checks, replay from offsets, and dead-letter handling for bad records. End with monitoring, lag alerts, and data quality checks between source counts and sink counts.
That answer feels structured because it moves in a clear order: requirements, design, tradeoffs, and operations.
What to say if the interviewer asks for AWS-specific choices
Map the same design to AWS without changing the logic. Use AWS DMS for source capture, then send events to Amazon MSK or Kinesis. For processing, choose Lambda for lightweight validation or Glue streaming jobs when the logic is heavier. Land raw events in S3, then expose cleaned data through Athena or load curated tables into Redshift.
If the team cares about serverless ETL on AWS, say you’d prefer managed services first, then justify any move to MSK or custom consumers only when scale or control demands it.
Conclusion
A strong CDC pipeline answer is clear, reliable, and aware of tradeoffs. Interviewers want to hear how you think about ordering, duplicates, schema drift, replay, and monitoring, not a long list of tools.
If you can explain the flow in plain language, choose sensible defaults, and name the main failure paths, you’ll sound prepared and senior. For guided practice, Data Engineer Academy’s System Design Course is a solid next step after you can explain this design on a whiteboard without notes.
FAQ
What is CDC in data engineering?
CDC means change data capture. It tracks inserts, updates, and deletes in a source system, then sends those changes to another system. Teams use it to avoid full reloads, cut latency, and keep warehouses, lakes, or downstream apps closer to the source of truth.
Why does Kafka show up so often in CDC interviews?
Kafka is common because it decouples producers from consumers and keeps events for replay. It also handles bursty traffic well and supports multiple downstream readers. In interviews, it gives you a clear story for buffering, ordering by key, recovery, and fan-out.
Do I need exactly-once delivery for a CDC pipeline interview?
Usually, no. Exactly-once is hard and expensive across real systems. A better default answer is at-least-once delivery plus idempotent writes in the sink. That choice is easier to build, easier to explain, and good enough for many warehouse and analytics pipelines.
How do I handle deletes in a CDC design?
Carry delete events all the way through the pipeline. Then decide whether the target should hard delete the row, soft delete it with a flag, or keep history in a separate table. The right choice depends on audits, analytics needs, and downstream consumers.