
Data Quality Tests in SQL: Nulls, Duplicates, Ranges, and Referential Integrity
Data quality tests in SQL help you catch bad rows early, before they break dashboards, audits, or machine learning work. The four checks that matter most are nulls, duplicates, range rules, and referential integrity. They work well in Snowflake, BigQuery, Redshift, and Postgres because the logic stays close to the tables. One null customer_id can turn a revenue-by-customer report into a pile of unattributed sales.
Key Points
- SQL data validation is cheapest when it runs near the warehouse tables.
- Null checks catch missing values in fields that should never be empty.
- Duplicate checks protect counts, revenue totals, and one-row-per-entity models.
- Range checks stop impossible dates, prices, ages, and scores from slipping through.
- Referential integrity checks find orphaned rows before broken joins hide them.
Quick summary: Start with four warehouse tests that find the most common data problems. Then automate them after load, during transformation, and before you publish trusted models.
Key takeaway: The best SQL tests look for bad rows, not good ones. When valid data returns zero failing rows, automation becomes simpler and alerts become clearer.
Quick promise: If you can write basic where, group by, and join queries, you can build practical tests your team can reuse across reports, dbt models, and scheduled jobs.
Why data quality tests in SQL matter for every warehouse
Bad data spreads fast. A single broken table can push wrong totals into dashboards, create empty joins in a semantic layer, and feed noisy features into a model. Because of that, quality checks belong close to the data, not after someone spots a bad chart in a meeting.
What goes wrong when bad data slips through
Missing customer IDs make attribution impossible. Duplicate orders inflate revenue. Ages of 999 or prices below zero pass SQL syntax, but they fail the business rule. Orphaned foreign keys, such as orders with no matching customer, break joins and hide context.
Those failures are easy to miss because tables still load. A report may render fine while the numbers underneath drift farther from reality.
Where SQL checks fit in a modern pipeline
Most teams run checks after ingestion, during transformation, or right before a model or dashboard goes live. In practice, that means dbt tests, scheduled warehouse queries, or custom validation SQL inside orchestration jobs.
This placement matters. When tests run near the source tables, you catch issues before analysts build on them. That makes warehouse tests faster, cheaper, and easier to trace.
The four SQL tests every team should start with
You can write dozens of rules, but these four catch most early failures. They also map cleanly to common SQL patterns, so they work for beginner and senior teams alike.
Null checks that catch missing values early
Null checks protect required columns such as IDs, dates, status fields, and foreign keys. A query as simple as where customer_id is null can save hours of backtracking. You can also compare count(*) with count(customer_id) to spot missing values quickly.
Not every null is bad. Optional notes can stay empty. Primary keys, order dates, and event types usually can’t. When a small amount is acceptable, use a threshold, such as “fail if null rate exceeds 0.5%.”
Duplicate checks that protect unique records
Duplicates show up after retries, bad joins, or late backfills. They distort counts first, then everything built on top of those counts. Revenue tables feel this pain fast.
Use group by with having count(*) > 1 to test uniqueness on one column or a combination. In many cases, the real rule is not “order_id is unique” but “customer_id plus order_timestamp plus source_system should be unique.”
Range checks that keep values realistic
Range checks enforce business sanity. An age between 0 and 120 makes sense. A price greater than zero makes sense. An event date 10 years in the future usually does not.
This matters because valid SQL values are not always valid business values. Hard limits catch impossible data. Soft warnings catch suspicious spikes before they turn into customer-facing errors.
Referential integrity checks that protect joins
Referential integrity checks confirm that child rows have a real parent. Orders should map to customers. Line items should map to orders. Sessions should map to users, if the model expects that relationship.
A common pattern uses a left join from child to parent, then filters where parent.id is null. That finds orphaned rows. Database foreign keys help at write time, but analytics warehouses often load data without strict enforcement, so SQL checks still matter.
How to write simple SQL checks that are easy to reuse
A good test reads like a rule, not a puzzle. Keep each check focused on one failure type and one grain. Then make the output easy for people and schedulers to understand.
A clean pattern for pass or fail queries
The cleanest pattern returns failing rows only. If the query returns zero rows, the test passes. That works well in dbt and in scheduled SQL because the logic is obvious during review.
For example, select orders where order_total < 0, not all orders with a pass flag. Counting bad rows is easier to automate than proving every good row is good.
How to make tests readable for the next person
Name tests after the rule, not the table alone. orders_no_null_customer_id tells the on-call person what broke. Small helper CTEs can improve clarity when the business rule needs a few steps.
Keep logic in one place. Store tests next to transformations or in a dedicated tests folder. Add one short comment if the threshold comes from a real policy or contract.
When to use dbt tests, custom SQL, or database constraints
Different problems need different tools. This quick comparison helps:
| Approach | Best for | Strength | Limitation |
| dbt built-in tests | Nulls, uniqueness, relationships | Fast to set up | Less flexible for custom rules |
| Custom SQL checks | Business-specific validation | Full control | More code to maintain |
| Database constraints | Write-time protection | Strongest guardrail | Not always enforced in warehouses |
The takeaway is simple: use the lightest tool that still enforces the rule.
Built-in tests versus custom checks
Built-in dbt tests are great for common rules and quick coverage. They’re easy to read, easy to schedule, and a solid default for not null, unique, and basic relationship checks.
Custom SQL is better when the rule depends on business logic. Multi-column uniqueness, time-window thresholds, and special join conditions usually need hand-written SQL.
Why some teams still need constraints at the database level
Primary keys, unique constraints, and foreign keys stop bad writes before they land. That is the strongest control you can have. Still, many analytics teams inherit data after the write step, or they ingest from systems they don’t control.
Because of that, separate SQL data validation still matters. Source systems, file loads, and transformations can introduce new errors even when the original app database had strict rules.
How to monitor, alert, and fix failing tests without noise
A failing test is only useful if someone can act on it. Good monitoring ties each check to an owner, a severity level, and a clear next step.
Set thresholds so teams do not ignore alerts
Not every issue needs a hard stop. Some failures should warn, while others should block a publish job. A few nulls in an optional field may deserve a ticket. Missing primary keys should fail the pipeline.
Track failures so the same problem does not keep coming back
Log the test name, failed row count, timestamp, and affected table. Then assign an owner. Over time, repeat failures show patterns, and those patterns usually point to a flaky source feed or a transform bug that needs a real fix.
One-minute summary
- Add null, duplicate, range, and relationship checks to high-risk tables first.
- Write tests that return bad rows, because zero rows is easy to automate.
- Use dbt built-ins for common rules and custom SQL for business logic.
- Apply warning thresholds when small drift is acceptable.
- Track failures by owner so repeat issues lead to permanent fixes.
Glossary
- Null: A missing value in a SQL column.
- Duplicate: More than one row for the same expected key.
- Range check: A rule that limits values to acceptable bounds.
- Referential integrity: A rule that links child rows to valid parent rows.
- Foreign key: A column that points to a row in another table.
- Business key: A real-world identifier, such as email or order number.
- Surrogate key: A generated technical ID, often an integer or UUID.
- dbt test: A reusable data test run within a dbt project.
Start with the tables that matter most
The smartest place to begin is your highest-risk data, usually customer, order, and revenue tables. Add the four core checks there first, then automate them close to the warehouse so bad rows never become trusted outputs.
If you want hands-on practice, work through these tests in a SQL course or tutorial and build them against real tables. That step turns abstract rules into habits your team can trust.
FAQ
What is the simplest SQL data quality test?
A null check is usually the simplest starting point. Query a required column with is null, or compare count(*) with count(column). If the field is an ID, date, or status, any null often means the row is broken.
How do I check duplicates in SQL?
Use group by on the expected unique key and add having count(*) > 1. Check combinations when one column is not enough. For example, order lines may need order_id plus line_number, not either column alone.
Are dbt tests enough for warehouse data quality?
dbt tests cover many common rules well. They are a strong default for nulls, uniqueness, and relationships. Still, business-specific logic, warning thresholds, and custom range rules often need hand-written SQL.
Why do referential integrity checks matter in analytics warehouses?
They protect joins. If a child row has no parent, reports can lose context or drop rows during joins. Warehouses do not always enforce foreign keys, so relationship checks in SQL catch orphaned data before it spreads.
Which tables should I test first?
Start with tables tied to revenue, customers, orders, and core dimensions. Those tables feed the most dashboards and downstream models. When they break, the business feels it fast, so they offer the highest return on early testing.

