SQL MERGE for Data Engineers: Upserts, CDC, and Idempotent Pipelines

By: Chris Garzon | June 3, 2026 | 10 mins read

SQL MERGE matches incoming rows to existing rows and then updates, inserts, or deletes them in one statement. Data engineers use it to write upsert logic, finish CDC loads, and make repeat runs safe. In data engineering, SQL MERGE helps you keep warehouse tables current without chaining together separate update and insert jobs.

It also reduces failure points. If a batch retries, a well-built merge can land the same final state instead of creating duplicates.

Snowflake, BigQuery, Databricks, SQL Server, Redshift, and Postgres handle MERGE a little differently. The patterns below focus on what matters across platforms.

Key Points

SQL MERGE compares source rows to target rows by a match key.
It can update existing rows, insert new rows, and sometimes delete rows.
MERGE is often the last write step in a CDC pipeline.
Idempotent merge logic makes retries safer and cleaner.
Good keys and deduped source data matter more than fancy syntax.

Quick summary: MERGE works best when staged data is clean, keys are stable, and your rerun logic produces the same target table every time.

The Best Time to Start is NOW

What SQL MERGE actually does in a data warehouse

A merge statement joins a source dataset to a target table and decides what to do for each row. If the row already exists, you usually update it. If it does not exist, you insert it. In some cases, you also delete or soft-delete a row.

That sounds simple, but it solves a real warehouse problem. Without MERGE, you often need separate statements for updates, inserts, and cleanup. Each extra step adds more surface area for bugs.

How MERGE decides whether to insert, update, or delete

The logic starts with a match condition, usually a business key such as customer_id, order_id, or a composite key. Rows that match are “matched.” Rows that do not are “not matched.”

If a source row with customer_id = 42 already exists in the target, the merge treats it as an update candidate. If customer_id = 99 is new, the merge inserts it. If the source says a row was deleted, the merge may remove it or mark it inactive.

This is the basic pattern:

Source row state	Match in target	Common action
Current row arrives	Yes	Update selected columns
Current row arrives	No	Insert new row
Delete event arrives	Yes	Delete or soft-delete
Duplicate source row arrives	Maybe	Fail, skip, or pick one before merge

The key takeaway is simple: MERGE is driven by row matching, not by file order.

Why data engineers prefer MERGE over manual upsert logic

MERGE cuts down on moving parts. One statement is easier to read than an update followed by an insert and then a delete pass.

It also helps maintenance. When column logic changes, you update one place instead of three. That matters in shared pipelines where many people touch the same job.

Still, MERGE is not magic. Bad keys, duplicate source rows, or null match values can break the result fast. Clean source data comes first.

How to use MERGE for upserts without breaking your table

For warehouse loading, the safest pattern is to stage the source data first. That staging layer can be a temp table, a landing table, or a staging query. Then you merge from that controlled input into the final table.

This keeps your logic clear. It also gives you one place to dedupe, filter bad keys, and compare old versus new values before you write.

Set the right keys before you run the statement

Your merge key is the foundation. A natural key comes from the business, such as an email or order number. A surrogate key is an internal ID, often numeric. In most upsert cases, you match on the stable business key, not the warehouse surrogate key.

Problems start when the source key is missing or duplicated. Some engines reject the merge. Others update one matching row in a way you did not expect. Neither outcome is fun at 2 a.m.

Before every merge, validate these basics:

The source key is not null.
The source has one row per merge key.
The target does not contain duplicate business keys.

When to update existing rows and when to insert new ones

The common rule is easy. If the source key already exists, update the current record. If the key is new, insert it.

However, avoid full-row updates when only one field changed. Updating unchanged rows can waste compute, trigger extra clustering work, and make audit columns noisy. Many teams compare a hash, timestamp, or selected columns first, then update only real changes.

That small step keeps your upsert SQL lean and your warehouse cleaner.

Using MERGE with CDC streams and incremental loads

Change data capture, or CDC, records inserts, updates, and deletes from a source system. The merge usually happens after those events land in a raw table or change feed. In other words, MERGE is often the final write step, not the whole pipeline.

A healthy CDC flow looks like this: capture events, land them, clean and dedupe them, then merge them into the target table. Each stage does one job well.

How CDC events map to MERGE actions

This mapping helps keep the workflow predictable:

CDC event	Typical merge action	Common note
Insert	Insert if new, update if resent	Resent events happen
Update	Update matched row	Pick the latest version first
Delete	Delete row or set is_deleted = true	Soft deletes are common

A soft delete keeps the row but marks it inactive. That works well when reports or audits still need history.

Handling late arriving changes and out-of-order records

CDC data rarely arrives in perfect order. Event time and load time often differ, especially across queues, connectors, or retries.

Because of that, you usually need one more step before the merge. Rank records by source sequence, change version, or updated_at, then keep the latest row per key. If you skip that step, an older update can overwrite a newer one.

For incremental loads, this is where many bugs hide. The merge statement looks fine, but the staged input is wrong.

Making your pipeline idempotent so reruns do not create duplicates

An idempotent data pipeline can run more than once and still leave the target table in the same correct state. That matters when jobs fail halfway, schedulers retry, or the same source file shows up twice.

MERGE is a strong fit for idempotency because it checks whether the row already exists before writing. Still, the statement alone does not make the pipeline safe. The input data and batch rules matter just as much.

Common mistakes that break idempotency

Duplicate source rows are the first trap. If the same key appears twice in one batch, your merge may fail or apply the wrong version.

Unstable ordering is another problem. If “latest record wins” depends on load order, a rerun can change the result. Missing match keys cause even more damage because the merge can insert duplicates instead of updating the same row. Partial batches are risky too, because you may merge half the truth and then rerun against a different slice.

Checks that make reruns safer

Use a few boring checks, because boring pipelines are reliable pipelines:

Dedupe the staged source by merge key before writing.
Keep only the latest record per key when events can arrive out of order.
Use a watermark or batch ID so you know what data belongs in each run.
Compare affected row counts to expected ranges after the merge.

Key takeaway: an idempotent pipeline is less about clever SQL and more about stable keys, clean staging data, and repeatable batch boundaries.

Best practices and tradeoffs before you put MERGE in production

MERGE can be clean and fast, but performance depends on your engine and table layout. Large merges may scan more data than you expect if the target is not partitioned or clustered well. Some platforms also hold stronger locks than others.

That is why testing on your own warehouse matters. Snowflake, BigQuery, Databricks, SQL Server, Redshift, and Postgres all support merge-like patterns, but the cost profile and edge cases differ.

What to test before you trust the merge job

Start with production-like samples, not toy data. Test duplicate keys, null keys, late updates, delete events, and reruns of the same batch.

Also check row counts before and after each run. If the same batch lands twice, the second run should not grow the table unless the source truly changed.

When a different pattern may be better than MERGE

Sometimes MERGE is not the best tool. Append-only tables are simpler when you want a full change log. Insert-overwrite can be faster when you rebuild a whole partition. Partition swaps work well when you prepare clean data separately and then replace old data in one move.

Pick MERGE when you need upserts, CDC handling, or repeatable batch loads. Pick a simpler pattern when the table design does not need row-level matching.

FAQ

What is SQL MERGE in data engineering?

SQL MERGE is a statement that compares source rows to a target table and then updates, inserts, or deletes rows based on a match rule. Data engineers use it to keep warehouse tables current without writing separate steps for each action.

Is MERGE the same as an upsert?

Not exactly. An upsert usually means update if matched, insert if not matched. MERGE can do that, but it can also handle delete logic and more detailed match conditions. So every upsert can look like a merge pattern, but MERGE can do more.

Should I use MERGE for CDC pipelines?

Yes, often as the final write step. CDC captures changes from the source system, but the merge applies those changes to the warehouse table. You still need landing, deduping, ordering, and event filtering before the merge runs.

How do I make a MERGE pipeline idempotent?

Start with stable keys and deduped staged data. Then keep one latest record per key, use clear batch boundaries, and test reruns with the same input. If the second run changes the final result, the pipeline is not idempotent yet.

What should I learn after SQL MERGE?

Learn data modeling and CDC design next. MERGE gets much easier when you understand business keys, slowly changing dimensions, and incremental loading patterns. A good next read is a guide on CDC architecture or dimensional modeling for analytics warehouses.

Conclusion

SQL MERGE is a practical tool for warehouse upserts, CDC targets, and rerunnable batch jobs. It works best when your match keys are trustworthy, your staged data is deduped, and your rerun rules are stable.

Review one existing pipeline this week and ask a simple question: if it reruns tonight, will the final table stay correct? If the answer is unclear, start by testing the merge on a staging table. If you want guided practice, Data Engineer Academy’s SQL course can help you work through these patterns on real projects.

Next Article: Partitioning and Clustering in Warehouses

Chris Garzon

Christopher Garzon has worked as a data engineer for Amazon, Lyft, and an asset management start up where he was responsible for building the entire Data Infrastructure from scratch. He is the author “Ace the Data Engineer Interview” and has helped 100’s of students break into the data engineer industry. He is also an angel investor, an advisor to multiple to multiple start ups, and the founder and CEO of Data Engineer Academy.