
Airflow in Production: Backfills, Retries, SLAs, and Failed DAG Recovery
Production Airflow problems usually come from three places: bad retry settings, risky backfills, and weak recovery plans after failures. The fix is a short set of airflow production best practices that keep reruns safe, reduce alert noise, and stop duplicate writes.
If your DAGs fail at 2 a.m., the hard part is not clicking “clear” in the UI. The hard part is knowing when to retry, when to rerun, and when to reprocess history without breaking downstream jobs.
Key Points
- Retries should cover short outages, not bad code.
- Backfills should target clear date windows and safe writes.
- Every important DAG needs an owner and a recovery runbook.
- SLAs should reflect real freshness needs, not wishful timing.
- Recovery ends only after the data checks pass.
Quick summary: Stable Airflow depends on controlled retries, narrow backfills, clear SLAs, and recovery steps that verify data after the DAG turns green again.
Key takeaway: Failure is normal in production. Unplanned recovery is what causes duplicate rows, stale dashboards, and on-call chaos.
Quick promise: Apply these habits and you will spend less time babysitting DAGs and more time fixing the few failures that matter.
Start with a production mindset before you touch retries or backfills
Airflow in production is not about getting tasks to run once. It is about predictable behavior every day. A DAG that “usually works” is risky when it feeds finance tables, executive dashboards, or customer exports.
Common failures are obvious, like red tasks and pager alerts. Others hide in plain sight, like duplicate records, missed partitions, and late data that lands after reporting closes. Because of that, each important DAG should have an owner, a runbook, and a rule for what happens when a run fails.
What makes a DAG safe to run in production?
A production-ready DAG has idempotent tasks, explicit dependencies, sensible retries, and outputs you can inspect. If you rerun a task, it should overwrite the right partition or skip work that already finished. It should not insert the same rows twice or break a downstream dbt model.
Why failure handling matters more than perfect schedules
No team gets zero failures. APIs time out, warehouses pause, and files arrive late. Good failure handling protects trust, because people can recover the right window quickly and confirm the data is still correct.
Use retries the right way so temporary issues do not break the pipeline
Airflow retries help with short-lived problems, not broken logic. A retry makes sense when an API call times out, a warehouse connector drops, or a database lock clears after a brief wait. In those cases, one more attempt can save a run and cut noise.
The settings matter. Retry count, retry delay, and exponential backoff should match the task. Small metadata checks can retry quickly. Heavy loads into Snowflake, BigQuery, or Redshift need more caution because each retry costs time and compute.
Which failures should retry, and which should fail fast?
Retry transient errors such as network timeouts, rate limits, lock conflicts, and brief auth issues. Fail fast for bad SQL, schema drift, missing required files, and code bugs. If the input is wrong or the query is wrong, more retries only delay the real fix.
How to set retry timing without creating alert fatigue
Retries should buy time for a system to recover, not hide a broken DAG for hours. Short delays fit small tasks. Backoff helps when contention is likely. Most importantly, alerts should reach the on-call team before the business deadline passes. Otherwise, the DAG looks “busy” while the data is already late.
Treat backfills as controlled reprocessing, not a quick rerun
An Airflow backfill is a way to fill missed intervals or reprocess old data after a fix. It is useful after upstream outages, late-arriving files, or a logic change that affected historical partitions. It is also one of the easiest ways to overload a warehouse or corrupt downstream data if you rush it.
This quick comparison keeps the choice clear:
| Situation | Best move | Reason |
| One task failed in one run | Rerun | The time window already exists |
| Several historical intervals are missing | Backfill | You need scheduled history rebuilt |
| Logic changed for old partitions | Backfill | Past output is wrong |
| Bad code still exists | Fail first | Reprocessing now repeats the bug |
The main takeaway is simple: match the recovery action to the time window that is wrong.
When should you backfill instead of rerunning a failed task?
Use a rerun for a single failed attempt when the rest of the run is still valid. Use a backfill when entire historical intervals are missing, partial, or wrong after a code fix. Do not backfill “everything” unless the business impact truly spans all history.
How to backfill without creating duplicate data
Backfills stay safe when writes are idempotent. Partition overwrite patterns work well, because each rerun replaces one logical slice instead of appending again. Run small date batches first, check row counts and freshness, then widen the range. After that, confirm that downstream tables, dashboards, and exports still line up with the corrected window.
Build a clear plan for failed DAG recovery before the incident happens
DAG failure recovery should be a process, not a guess. Start with the logs, identify the real error, and then decide whether the run needs a retry, a rerun, or a backfill. If partial data already landed, clean that state before you replay the work.
A simple decision flow for recovery: retry, rerun, or backfill?
First, check whether the failure was temporary. If yes, a retry is often enough. Next, check whether the failed run wrote partial data for that window. If yes, clean or overwrite that partition before rerunning. Then ask whether code or input data changed. If the fix applies to past intervals, run a backfill for only those dates.
What to check after a DAG comes back up
Green boxes are not the finish line. Check row counts, partition completeness, freshness, and upstream file arrival. Then verify downstream consumers, such as marts, dashboards, or exports. Close the alert only after the data is trustworthy again.
A recovered DAG is useful only when the recovered data matches the expected business window.
The habits that keep Airflow stable over time
Stable Airflow setups usually look boring, and that is a good thing. Smaller DAGs fail in clearer ways. Isolated tasks make reruns safer. Regular review of failure patterns shows which jobs need better retries, better ownership, or better upstream contracts.
How SLAs and alerts should support real operations
SLAs help teams notice late data early. They should reflect business freshness, not the fastest run you saw once in a test week. A billing pipeline may need tight alerting. A daily internal report may not. Set thresholds by business impact, then route alerts to the people who can act on them.
What to document so the next failure is easier to fix
Each critical DAG needs a short runbook. Keep the owner, common failure modes, backfill steps, and validation checks in one place. When the same incident happens again, the team should not have to rediscover the fix.
One-minute summary
- Review each critical DAG for idempotent writes before you trust reruns.
- Use retries for temporary failures and fail fast on bad logic.
- Backfill only the dates that are missing or wrong.
- Validate downstream data after every recovery.
- Write runbooks and set SLAs that match business deadlines.
Glossary
- DAG: A workflow graph that defines task order in Airflow.
- Retry: Another attempt after a task fails.
- Backfill: Reprocessing scheduled intervals from the past.
- Idempotent task: A task that can rerun safely without duplicate results.
- Partition: A logical slice of data, often by date.
- SLA: A target time for data freshness or task completion.
- Runbook: A short guide for handling common failures.
- Downstream job: A job that depends on earlier pipeline output.
FAQ
When should I use Airflow backfill?
Use Airflow backfill when historical scheduled intervals are missing or wrong. Common cases include upstream outages, late files, or a code fix that changed past results. Keep the window narrow, run small batches first, and verify the output before expanding the range.
How many retries should an Airflow task have?
Start small. Many tasks only need one to three retries. Temporary API calls may need a few quick retries, while heavy warehouse loads should retry less often because each attempt is expensive. If a failure needs code or schema changes, fail fast instead of adding more retries.
What is an SLA in Airflow?
An SLA in Airflow is a time expectation for task or pipeline completion. In practice, it is a freshness promise to the business. Set it around when the data must be ready, not when the task usually finishes on a good day. Useful SLAs create action, not noise.
How do I recover a failed DAG without duplicate data?
Start by checking whether the failed run wrote partial output. If it did, clean or overwrite that partition before rerunning. Then replay only the affected window. Safe recovery depends on idempotent writes, partition-aware logic, and validation checks on row counts, freshness, and downstream results.
Conclusion
Strong Airflow setups do not avoid every failure. They recover in a controlled way, with retries that match the error, backfills that match the business window, SLAs that reflect real deadlines, and checks that prove the data is correct after recovery.
Pick one important DAG this week and document its retry rules, backfill process, SLA, and post-recovery checks. If you want guided practice, the DE Projects Course from Data Engineer Academy walks you through real pipelines that you can debug, rerun, and validate.

