system design
Tips and Tricks

System Design Projects for Data Engineers

The fastest way to grow as a data engineer is to study systems that solve real business problems. These system design projects go beyond theory. They show how teams handled messy data, slow pipelines, legacy platforms, rising costs, and the push for near real-time analytics.

If you’re trying to get better at cloud architecture, medallion design, CDC, or warehouse migrations, these examples give you a practical map. You’ll see how Azure, Databricks, Snowflake, AWS, Kafka, and Power BI fit together in real environments, not just in isolated tutorials.

Key takeaway

  • Real system design projects teach the trade-offs behind cloud migration, data modeling, automation, and analytics at scale.
  • Azure Databricks and Snowflake show up often because they reduce legacy complexity and speed up trusted reporting.
  • The medallion architecture keeps appearing because it helps teams separate raw, cleaned, and business-ready data.
  • CDC, metadata-driven pipelines, and real-time ingestion cut delays and reduce heavy full-table refreshes.
  • Strong projects don’t just move data, they improve trust, cost, speed, and day-to-day productivity.

Why these projects matter if you want to build real data engineering skills

A lot of learning content stops at tool features. That’s useful, but it misses the hard part. Real work starts when data arrives from many sources, business rules change, reporting must stay reliable, and old systems fight back.

That’s why project-based learning matters so much. You start seeing patterns. One company struggles with 47,000 job configurations. Another hits scaling limits with AWS Glue. A retailer loses access to historical data because its old warehouse design never kept raw events. Different industries, same core problem.

Most of these projects come down to a few design questions:

  • Where should raw data land first?
  • How do you clean it without losing traceability?
  • When do you use batch, and when do you need streaming?
  • How do you keep costs under control while data volume grows?
  • How do you make business teams trust the output?

Those are the questions that turn someone from a tool user into a systems thinker.

Project 1: A Fortune 500 healthcare provider rebuilt its data platform on Azure and Databricks

This project started with a hard truth. A large healthcare provider had data everywhere, but trusted insight nowhere. After the migration, the company reported a 30% productivity boost. That number mattered, but the bigger story was how they got there.

The old setup had five major problems:

  • Unstructured data sprawl across 12 data sources and 7 lines of business
  • Too much operational complexity, including 47,000 job configurations, 90-plus projects, and 2,000 workflows
  • High legacy costs from tools like Ab Initio and Netezza
  • Slow change cycles, where adding a source or updating a process took weeks
  • Poor data quality, which made reports hard to trust

In healthcare, that’s a dangerous mix. Delayed or unreliable data doesn’t just slow a team down. It affects decisions that can impact care, operations, and compliance.

How the new Azure architecture worked

The team rebuilt the platform around Azure Data Factory, Databricks Delta tables, Azure Synapse Analytics, and Power BI. Each tool had a clear job.

Azure Data Factory centralized ingestion

Azure Data Factory acted as the intake layer. It connected on-prem databases, legacy systems, APIs, and flat files through one managed pipeline approach.

That mattered for three reasons. First, it removed silos. Second, it reduced manual work. Third, it supported both batch and streaming use cases, which made the platform more future-ready.

Databricks Delta tables organized processing with the medallion model

After ingestion, the data moved through the bronze, silver, and gold pattern. That layered design gave the team traceability and better control over quality.

A simple way to think about it is this:

LayerPurposeBusiness value
BronzeStore raw data exactly as receivedKeeps source history intact
SilverClean, enrich, and standardize dataImproves quality and consistency
GoldShape data for dashboards and reportsMakes analytics faster and easier

That structure sounds simple, and that’s the point. Good architecture often replaces chaos with a repeatable pattern.

Metadata-driven automation removed hardcoded pipeline work

The smarter part of the design came next. Instead of hardcoding workflows for every source, the team built a metadata-driven ingestion framework.

That meant new sources could follow rules instead of custom pipeline logic. It also meant each source used the same transformation standards. As a result, the system became easier to scale and easier to maintain.

Synapse and Power BI turned trusted data into decisions

Once gold-layer data was ready, Azure Synapse stored and served it for analytics. Dedicated SQL pools handled complex queries quickly, and Power BI turned the results into dynamic dashboards.

Business teams could track KPIs, spot trends, and move faster because the reporting layer finally matched the speed of the business.

Why Azure made sense here

Azure fit this migration for three practical reasons.

First, integration. The platform connected on-prem systems to the cloud without forcing a full rip-and-replace on day one.

Next, tool fit. Databricks, Synapse, Data Factory, and Power BI worked together as one connected stack.

Finally, cost control. The pay-as-you-go model helped the company move away from expensive legacy infrastructure.

What changed after the migration

The results went beyond one performance metric. Productivity improved by 30%, data quality improved, legacy costs dropped, and the platform could grow with new sources and larger workloads.

When a platform preserves raw data, automates standard transformations, and exposes trusted gold-layer outputs, teams stop arguing about the numbers and start using them.

If you want to practice architectures like this, the Data Engineer Academy coursework is a good place to start.

Project 2: Moving from AWS Glue to Snowflake for faster and simpler data pipelines

This company first built its ETL flow on AWS Glue. That made sense at the start. Glue is a serverless data integration service, and it covers the main ETL building blocks in one ecosystem.

What AWS Glue handled in the original setup

The architecture relied on four core Glue components.

Crawler: This scanned raw data, detected schema, and registered metadata.

Data Catalog: This stored metadata about tables, columns, and file formats so other services could use the data consistently.

Glue Jobs: These ran ETL logic in PySpark or Scala to clean and transform data.

Triggers: These launched jobs on demand, on a schedule, or in response to events such as new files landing in S3.

On paper, that sounds complete. In practice, the setup became harder to manage as volume grew.

Where the old Glue pipeline started to break

The flow looked like this:

  1. Raw data landed in an S3 bucket.
  2. Lambda detected new files.
  3. Glue crawler scanned the data and updated metadata.
  4. Files moved into a named-file S3 bucket for staging.
  5. Glue jobs transformed the staged data.
  6. Cleaned data landed in a final S3 bucket for analytics.

That design worked, but it created heavy operational overhead. Engineers had to manage multiple moving parts, and the batch-heavy design delayed data availability. Three separate S3 buckets also made the architecture more complex than it needed to be.

The system struggled even more when daily volume reached 500 GB per day.

How Snowflake simplified the design

The migration replaced a multi-step ETL pattern with a much shorter path.

For historical data, the team used Snowflake’s COPY INTO command to bulk load files from S3. For near real-time data, they added Snowpipe, which reacted to S3 events and loaded new files into target tables automatically.

That changed the architecture from “many services glued together” into a simpler pattern:

Sources → S3 → Snowflake ingestion → SQL-based transforms → BI and analytics

A big shift happened here. The logic moved closer to the warehouse, and no more PySpark scripts were needed for every transformation step.

The main wins from this migration

The company got several gains at once.

  • Automated workflows: Less manual tuning and fewer moving parts
  • Near real-time availability: Snowpipe loaded new data as files arrived
  • Lower costs: Snowflake’s pay-per-second model reduced wasted compute
  • Simpler design: Fewer intermediate layers and less support overhead

This is one of the clearest examples of a strong system design choice: simplify the path, reduce components, and keep the business closer to current data.

Project 3: A global retailer replaced its legacy warehouse with Azure Databricks

A global retailer with more than 3,000 stores across 40 countries had relied on a SQL Server 2014 enterprise data warehouse for over 15 years. For a long time, it worked. Then growth caught up with it.

What the old retail stack looked like

The legacy environment had several major parts.

The e-commerce side ran on ATG, which handled online orders, apps, marketplaces, and region-specific promotions. In-store operations ran through a DB2 mainframe POS system that supported transactions, loyalty programs, and order management.

IBM DataStage handled ETL into the central warehouse. From there, business teams used Cognos and Tableau for reporting.

That’s a familiar enterprise picture. Different channels fed one central warehouse, and BI tools sat on top. The issue was that the whole system became too rigid.

What broke as the retailer expanded

The retailer faced a set of problems that many older warehouses run into.

On-prem servers were near capacity, so scaling meant buying more hardware. Batch jobs took hours, which delayed reporting. New data sources required custom development, and if raw historical data wasn’t captured early, it was gone for good.

Even documentation had become a problem. After 15 years of changes, many updates were poorly recorded. That made support and troubleshooting slower than they should have been.

How the Databricks medallion architecture fixed those problems

The new architecture centered on Azure Databricks and the medallion pattern.

Bronze stored raw events

The bronze layer kept incoming data exactly as it arrived from sources such as APIs, streaming feeds, and IoT-style retail events. That preserved history and fixed the old problem of losing valuable raw records.

Silver cleaned and normalized data

In the silver layer, the team standardized schemas, removed duplicates, and cleaned regional differences across sales and transaction data. That made the data fit for analysis.

Gold prepared data for the business

The gold layer grouped and summarized data into forms that BI tools could use quickly. Sales by region, time period, or product category became easier to query and visualize.

Automation and validation kept the system healthy

Azure Data Factory orchestrated the movement of data between layers. Spark handled real-time processing that replaced slow batch cycles. A custom validation framework checked schema mismatches and data quality issues, and Azure DevOps supported CI/CD so updates didn’t turn into manual release pain.

Problem-to-solution mapping in the retail migration

Here’s the core shift in one view:

Legacy problemNew design responseResult
On-prem scaling limitsDatabricks dynamic scalingResources grew with demand
Slow batch reportingSpark-based real-time processingFaster insight delivery
Difficult source integrationADF orchestrationLess custom pipeline work
Lost raw historyBronze layer storageBetter long-term analysis
Heavy manual updatesCI/CD and automationLower support effort

The measured results were strong. Processing became 30% faster, infrastructure costs dropped by 40%, historical data stayed available, and dashboards became far more useful because the data behind them was current.

Project 4: Media and entertainment modernization with AWS, Snowflake, and data contracts

This case focused on a media and entertainment company that wanted a single source of truth. The old environment had too many separate systems, weak collaboration, and data pipelines that took too long to change.

Why the old setup slowed teams down

The legacy architecture relied on AWS Glue plus custom Python scripts to move data from sources like Oracle, MariaDB, and streaming feeds into S3 and Snowflake layers.

The problem wasn’t only the tools. It was the operating model around them. Data pipelines were complex, teams lacked one shared source of truth, and new ingestion patterns could take weeks or even months to add. That meant data engineering moved slower than the business.

What changed in the new state

The future design added several important ideas.

Data contracts gave data producers and consumers shared rules for schema, format, and exchange. That reduced confusion between teams and improved trust in the data.

The company also defined core data products and derived data products. Core products came from the platform team and served broad business needs. Derived products let specific teams, like marketing or finance, shape data for their own use cases.

Metadata and observability improved too. The team used Alation for metadata management and OpenTelemetry for data observability, which made it easier to see how data moved and where issues started.

How the new architecture worked

Batch data moved from source systems through AWS DMS, Kinesis streams, Lambda, and Kinesis Firehose into S3. Streaming data skipped the DMS step and flowed through Kinesis into S3 more directly.

From there, Snowpipe loaded new files into Snowflake with about one minute of latency. Transformations ran through DBT, while UC4 handled scheduling.

The company also organized Snowflake data through a medallion-style pattern:

  • Landing: raw incoming data
  • Bronze: cleaned and flattened data
  • Silver: de-duplicated and versioned data
  • Gold: curated business-ready data

That combination mattered because it improved both speed and team ownership. Business units could work with trusted data products instead of waiting on a central team to interpret every request.

Why this project connects well with data mesh

This case naturally leads into data mesh thinking. The point wasn’t just a new warehouse. It was a change in how teams owned and shared data.

Data contracts helped teams scale because each group could publish data in a consistent way. In other words, central governance stayed in place while domain ownership became stronger.

Project 5: A delivery services provider moved from Postgres to Snowflake for near real-time analytics

This company had a legacy Postgres-based data warehouse that simply couldn’t keep up with the workload. Business users needed faster reporting, but the platform kept falling behind.

What made the old warehouse hard to live with

The old setup had several major issues.

Data ingestion and processing often took hours or even days. Outages were common, and the company reported that 90% of failures were tied to system downtime. ETL jobs required constant manual attention, third-party integrations took too long, and there was no proper change data capture. Every update meant refreshing full tables instead of applying only changes.

That’s expensive, slow, and frustrating. It also blocks growth because each new source adds more pressure to an already weak system.

The new cloud-native pattern

The replacement architecture used Fivetran to extract data from Postgres and land it in Amazon S3. From there, Snowpipe loaded data into Snowflake automatically.

For CDC, the team brought in Kafka, so only changed records needed processing. That reduced full refreshes and cut both time and compute waste. Transformations moved into Snowflake, and curated data marts gave BI and data science teams faster access to usable data.

How the migration was executed

The migration followed a careful sequence.

  1. The team assessed the current environment and mapped dependencies.
  2. They used an incremental migration strategy with Fivetran and Kafka CDC.
  3. They validated migrated data against the legacy warehouse.
  4. They cut over to Snowflake and added monitoring and alerting.

That order mattered. Big migrations fail when teams skip validation or move everything at once.

What changed for the business

The results were strong and easy to understand:

  • Near real-time data availability
  • 90% fewer outage-related failures
  • Much less manual pipeline support
  • Faster onboarding of third-party sources
  • Lower cost through auto-scaling and managed compute

This project is a great study in why CDC matters. When only changed records move, the whole system gets lighter.

Project 6: An automotive OEM used AWS to archive and manage ADAS workloads

ADAS, short for Advanced Driver Assistance Systems, produces huge amounts of data. Camera feeds, radar, lidar, simulation outputs, and model inputs add up fast. For one automotive OEM, on-prem storage had become too expensive and too slow.

Why ADAS data creates a different kind of problem

This wasn’t just a warehouse issue. The company dealt with terabytes of data per day, often collected from remote test sites with weak connectivity.

That created three pressure points. First, storage growth was intense. Second, moving data from field locations was difficult. Third, the company needed both high-performance active storage and low-cost archive storage for long-term retention and compliance.

How the AWS architecture handled ingestion, storage, and archive

The AWS design solved the problem in layers.

Ingestion used Snowball and DataSync

For locations without strong internet access, the company used AWS Snowball. Data was copied to the device and shipped for upload.

For ongoing online movement, AWS DataSync handled continuous transfer from on-prem systems to AWS.

Amazon S3 became the storage hub

Once data landed in AWS, Amazon S3 stored and organized it in buckets. Tagging and categorization helped the company find and manage ADAS data more easily.

Tiered archive reduced long-term storage cost

The company placed data into different S3 storage classes based on usage:

Storage tierBest fit
S3 StandardFrequently accessed active data
S3 Standard-IAData used from time to time
S3 GlacierLong-term archive and compliance retention

That tiering model is simple, but it matters. Not all data needs fast storage forever.

Processing pipelines kept data useful

AWS processing pipelines handled KPI extraction, sensor log analysis, automatic labeling, and data quality checks. That meant the archived data wasn’t just parked and forgotten. It stayed usable for model training, validation, and compliance work.

The company also used AWS Migration Evaluator and DataSync Discovery to plan the move and reduce migration risk.

Project 7: The big ideas behind these projects, data mesh and lakehouse design

Underneath all of these examples, two ideas show up again and again: domain-based ownership and flexible but governed storage.

Data mesh makes more sense when you picture a food court

A food court is a simple way to understand data mesh.

Each food outlet owns its menu. In the same way, each business domain owns its data. Marketing owns campaign data. Sales owns revenue data. Support owns customer issue data.

The mall still provides shared seating, payments, and rules. That’s the same as self-service platform infrastructure and federated governance in a data mesh model.

Here are the four core ideas:

  • Domain-oriented ownership: Teams own the data closest to their work
  • Self-service infrastructure: The platform gives teams shared tools and standards
  • Data as a product: Data should be clean, documented, and useful to others
  • Federated governance: Rules stay shared even when ownership is spread out

If you’ve ever seen a central data team become a bottleneck, this model starts to feel very practical.

Why the lakehouse model keeps showing up

Data warehousing worked well when most data fit rows and columns. Then unstructured and semi-structured data exploded, and data lakes gave teams cheap storage with more flexibility.

The problem was governance. Many lakes turned into messy dumping grounds.

That’s why the lakehouse pattern gained traction. It combines the storage flexibility of a data lake with warehouse-style structure, quality, and analytics support. It also fits well with the medallion model because raw, cleaned, and curated layers naturally belong in a lakehouse pattern.

How the lakehouse stack changes across clouds

The pattern stays similar even when the tools change.

PlatformIngestionStorageTransformationAnalytics layer
AWSDMS, Kinesis, FirehoseS3Glue, EMRRedshift
AzureData Factory, Stream AnalyticsADLS Gen2Databricks, Synapse SparkSynapse Analytics
GCPPub/Sub, DataflowCloud StorageDataproc, DBTBigQuery
SnowflakeFivetran, Airbyte, SnowpipeS3, ADLS, or GCS underneathSnowpark, SQL, DBTSnowflake

The exact tools change, but the logic stays familiar: ingest raw data, store it cheaply, transform it in stages, and publish trusted outputs for analytics.

Patterns worth copying from these system design projects

When you step back, these case studies repeat a few strong design habits.

Preserve raw data first. Teams that keep a bronze or landing layer can reprocess history later and fix mistakes without losing source records.

Automate pipeline rules. Metadata-driven ingestion and CI/CD reduce support work and cut manual pipeline edits.

Use CDC when freshness matters. Full refreshes waste time and compute. CDC keeps the system lighter.

Separate raw, cleaned, and curated data. The bronze, silver, gold pattern makes debugging easier and reporting more reliable.

Reduce moving parts where you can. A simpler architecture is often a better architecture, especially when support load is already high.

Strong data architecture doesn’t try to look clever. It makes messy systems easier to trust, operate, and extend.

FAQ

What makes a good data engineering system design project?

A good project solves a real business problem with clear trade-offs. It should include source systems, ingestion, storage, transformation, serving layers, and operational concerns like cost, quality, and monitoring. The best projects also show measurable outcomes, such as lower latency, fewer failures, or better reporting speed.

Why do so many teams use bronze, silver, and gold layers?

Teams use bronze, silver, and gold layers because the pattern separates raw data, cleaned data, and business-ready data. That structure improves traceability, quality control, and reuse. It also makes debugging easier because engineers can pinpoint whether a problem started at ingestion, transformation, or reporting.

When should a team consider moving from AWS Glue to Snowflake?

A team should consider that move when Glue pipelines become hard to manage, batch processing creates delays, and data volume keeps rising. Snowflake can reduce operational overhead, support near real-time loading with Snowpipe, and simplify transformations through SQL-based processing closer to the warehouse layer.

Why is change data capture so useful in modern pipelines?

Change data capture is useful because it processes only new or changed records instead of refreshing full tables. That reduces compute cost, cuts latency, and improves pipeline efficiency. CDC also helps teams move closer to real-time reporting without forcing heavy full-load cycles on every update.

Which tools should aspiring data engineers learn first?

Start with SQL, Python, data modeling, and cloud storage basics. After that, learn one orchestration tool, one warehouse, and one distributed processing tool. If you want a guided path, the Data Engineer Academy coursework helps you practice these skills through real-world projects and system design exercises.

Conclusion

These projects all point to the same lesson: great data systems are built around trust, speed, and maintainability, not just tool choice. Whether the stack uses Azure, AWS, Databricks, or Snowflake, the winning patterns stay consistent, preserve raw data, automate transformations, reduce manual support, and publish business-ready outputs quickly.