Airflow in Production: Backfills, Retries, SLAs, and Failed DAG Recovery

By: Chris Garzon | May 28, 2026 | 8 mins read

Production Airflow problems usually come from three places: bad retry settings, risky backfills, and weak recovery plans after failures. The fix is a short set of airflow production best practices that keep reruns safe, reduce alert noise, and stop duplicate writes.

If your DAGs fail at 2 a.m., the hard part is not clicking “clear” in the UI. The hard part is knowing when to retry, when to rerun, and when to reprocess history without breaking downstream jobs.

Key Points

Retries should cover short outages, not bad code.
Backfills should target clear date windows and safe writes.
Every important DAG needs an owner and a recovery runbook.
SLAs should reflect real freshness needs, not wishful timing.
Recovery ends only after the data checks pass.

Quick summary: Stable Airflow depends on controlled retries, narrow backfills, clear SLAs, and recovery steps that verify data after the DAG turns green again.

Key takeaway: Failure is normal in production. Unplanned recovery is what causes duplicate rows, stale dashboards, and on-call chaos.

Quick promise: Apply these habits and you will spend less time babysitting DAGs and more time fixing the few failures that matter.

The Best Time to Start is NOW

Start with a production mindset before you touch retries or backfills

Airflow in production is not about getting tasks to run once. It is about predictable behavior every day. A DAG that “usually works” is risky when it feeds finance tables, executive dashboards, or customer exports.

Common failures are obvious, like red tasks and pager alerts. Others hide in plain sight, like duplicate records, missed partitions, and late data that lands after reporting closes. Because of that, each important DAG should have an owner, a runbook, and a rule for what happens when a run fails.

What makes a DAG safe to run in production?

A production-ready DAG has idempotent tasks, explicit dependencies, sensible retries, and outputs you can inspect. If you rerun a task, it should overwrite the right partition or skip work that already finished. It should not insert the same rows twice or break a downstream dbt model.

Why failure handling matters more than perfect schedules

No team gets zero failures. APIs time out, warehouses pause, and files arrive late. Good failure handling protects trust, because people can recover the right window quickly and confirm the data is still correct.

Use retries the right way so temporary issues do not break the pipeline

Airflow retries help with short-lived problems, not broken logic. A retry makes sense when an API call times out, a warehouse connector drops, or a database lock clears after a brief wait. In those cases, one more attempt can save a run and cut noise.

The settings matter. Retry count, retry delay, and exponential backoff should match the task. Small metadata checks can retry quickly. Heavy loads into Snowflake, BigQuery, or Redshift need more caution because each retry costs time and compute.

Which failures should retry, and which should fail fast?

Retry transient errors such as network timeouts, rate limits, lock conflicts, and brief auth issues. Fail fast for bad SQL, schema drift, missing required files, and code bugs. If the input is wrong or the query is wrong, more retries only delay the real fix.

How to set retry timing without creating alert fatigue

Retries should buy time for a system to recover, not hide a broken DAG for hours. Short delays fit small tasks. Backoff helps when contention is likely. Most importantly, alerts should reach the on-call team before the business deadline passes. Otherwise, the DAG looks “busy” while the data is already late.

Treat backfills as controlled reprocessing, not a quick rerun

An Airflow backfill is a way to fill missed intervals or reprocess old data after a fix. It is useful after upstream outages, late-arriving files, or a logic change that affected historical partitions. It is also one of the easiest ways to overload a warehouse or corrupt downstream data if you rush it.

This quick comparison keeps the choice clear:

Situation	Best move	Reason
One task failed in one run	Rerun	The time window already exists
Several historical intervals are missing	Backfill	You need scheduled history rebuilt
Logic changed for old partitions	Backfill	Past output is wrong
Bad code still exists	Fail first	Reprocessing now repeats the bug

The main takeaway is simple: match the recovery action to the time window that is wrong.

When should you backfill instead of rerunning a failed task?

Use a rerun for a single failed attempt when the rest of the run is still valid. Use a backfill when entire historical intervals are missing, partial, or wrong after a code fix. Do not backfill “everything” unless the business impact truly spans all history.

How to backfill without creating duplicate data

Backfills stay safe when writes are idempotent. Partition overwrite patterns work well, because each rerun replaces one logical slice instead of appending again. Run small date batches first, check row counts and freshness, then widen the range. After that, confirm that downstream tables, dashboards, and exports still line up with the corrected window.

Build a clear plan for failed DAG recovery before the incident happens

DAG failure recovery should be a process, not a guess. Start with the logs, identify the real error, and then decide whether the run needs a retry, a rerun, or a backfill. If partial data already landed, clean that state before you replay the work.

A simple decision flow for recovery: retry, rerun, or backfill?

First, check whether the failure was temporary. If yes, a retry is often enough. Next, check whether the failed run wrote partial data for that window. If yes, clean or overwrite that partition before rerunning. Then ask whether code or input data changed. If the fix applies to past intervals, run a backfill for only those dates.

What to check after a DAG comes back up

Green boxes are not the finish line. Check row counts, partition completeness, freshness, and upstream file arrival. Then verify downstream consumers, such as marts, dashboards, or exports. Close the alert only after the data is trustworthy again.

A recovered DAG is useful only when the recovered data matches the expected business window.

The habits that keep Airflow stable over time

Stable Airflow setups usually look boring, and that is a good thing. Smaller DAGs fail in clearer ways. Isolated tasks make reruns safer. Regular review of failure patterns shows which jobs need better retries, better ownership, or better upstream contracts.

How SLAs and alerts should support real operations

SLAs help teams notice late data early. They should reflect business freshness, not the fastest run you saw once in a test week. A billing pipeline may need tight alerting. A daily internal report may not. Set thresholds by business impact, then route alerts to the people who can act on them.

What to document so the next failure is easier to fix

Each critical DAG needs a short runbook. Keep the owner, common failure modes, backfill steps, and validation checks in one place. When the same incident happens again, the team should not have to rediscover the fix.

One-minute summary

Review each critical DAG for idempotent writes before you trust reruns.
Use retries for temporary failures and fail fast on bad logic.
Backfill only the dates that are missing or wrong.
Validate downstream data after every recovery.
Write runbooks and set SLAs that match business deadlines.

Glossary

DAG: A workflow graph that defines task order in Airflow.
Retry: Another attempt after a task fails.
Backfill: Reprocessing scheduled intervals from the past.
Idempotent task: A task that can rerun safely without duplicate results.
Partition: A logical slice of data, often by date.
SLA: A target time for data freshness or task completion.
Runbook: A short guide for handling common failures.
Downstream job: A job that depends on earlier pipeline output.

FAQ

When should I use Airflow backfill?

Use Airflow backfill when historical scheduled intervals are missing or wrong. Common cases include upstream outages, late files, or a code fix that changed past results. Keep the window narrow, run small batches first, and verify the output before expanding the range.

How many retries should an Airflow task have?

Start small. Many tasks only need one to three retries. Temporary API calls may need a few quick retries, while heavy warehouse loads should retry less often because each attempt is expensive. If a failure needs code or schema changes, fail fast instead of adding more retries.

What is an SLA in Airflow?

An SLA in Airflow is a time expectation for task or pipeline completion. In practice, it is a freshness promise to the business. Set it around when the data must be ready, not when the task usually finishes on a good day. Useful SLAs create action, not noise.

How do I recover a failed DAG without duplicate data?

Start by checking whether the failed run wrote partial output. If it did, clean or overwrite that partition before rerunning. Then replay only the affected window. Safe recovery depends on idempotent writes, partition-aware logic, and validation checks on row counts, freshness, and downstream results.

Conclusion

Strong Airflow setups do not avoid every failure. They recover in a controlled way, with retries that match the error, backfills that match the business window, SLAs that reflect real deadlines, and checks that prove the data is correct after recovery.

Pick one important DAG this week and document its retry rules, backfill process, SLA, and post-recovery checks. If you want guided practice, the DE Projects Course from Data Engineer Academy walks you through real pipelines that you can debug, rerun, and validate.

Next Article: CI/CD for Data Pipelines

Chris Garzon

Christopher Garzon has worked as a data engineer for Amazon, Lyft, and an asset management start up where he was responsible for building the entire Data Infrastructure from scratch. He is the author “Ace the Data Engineer Interview” and has helped 100’s of students break into the data engineer industry. He is also an angel investor, an advisor to multiple to multiple start ups, and the founder and CEO of Data Engineer Academy.

Airflow in Production: Backfills, Retries, SLAs, and Failed DAG Recovery

Start with a production mindset before you touch retries or backfills

What makes a DAG safe to run in production?

Why failure handling matters more than perfect schedules

Use retries the right way so temporary issues do not break the pipeline

Which failures should retry, and which should fail fast?

How to set retry timing without creating alert fatigue

Treat backfills as controlled reprocessing, not a quick rerun

When should you backfill instead of rerunning a failed task?

How to backfill without creating duplicate data

Build a clear plan for failed DAG recovery before the incident happens

A simple decision flow for recovery: retry, rerun, or backfill?

What to check after a DAG comes back up

The habits that keep Airflow stable over time

How SLAs and alerts should support real operations

What to document so the next failure is easier to fix

One-minute summary

Glossary

FAQ

When should I use Airflow backfill?

How many retries should an Airflow task have?

What is an SLA in Airflow?

How do I recover a failed DAG without duplicate data?

Conclusion

Related Articles

Airflow in Production: Backfills, Retries, SLAs, and Failed DAG Recovery