Blog

Writing from our team. The latest news, insights, and resources.

How to earn rewards by sharing the knowledge!

Referring a friend to something you genuinely believe in is one of the simplest yet most powerful ways to create opportunities. With that in mind, we’re excited to introduce the Data Engineer Academy Referral Program—a way to reward you for sharing the benefits of industry-leading data engineering training with the people you know. We designed...

By: Chris Garzon | November 25, 2024 | 8 mins read
Learn More

How to host a website on AWS EC2

In today’s digital world, both individuals and businesses require a powerful website. However, finding a trustworthy hosting company is an important step in creating a website. Amazon Web Services (AWS) EC2 provides a strong and scalable infrastructure for hosting websites, making it a great alternative for your hosting requirements. Step-by-step instructions for how to host...

By: ninad magdum | June 17, 2023 | 13 mins read
Learn More
Semantic Layer for Data Engineers

Semantic Layer for Data Engineers: Metrics, Models, and BI Consistency

A semantic layer in data engineering gives you one shared place to define metrics, business logic, and model relationships, so BI tools show the same numbers everywhere. When two dashboards disagree on revenue or active users, the problem is usually not the chart. The problem is inconsistent logic. That mismatch creates copied SQL, repeated reviews,...

By: Chris Garzon | June 4, 2026 | 9 mins read
Learn More
Partitioning and Clustering in Warehouses

Partitioning and Clustering in Warehouses: Performance Without Guesswork

Partitioning and clustering help a warehouse scan less data, which usually means faster queries and lower cost. In plain terms, warehouse partitioning and clustering are table layout choices that improve pruning, not magic fixes for bad SQL or weak models. That matters when dashboards slow down, fact tables keep growing, and cloud bills rise with...

By: Chris Garzon | June 4, 2026 | 9 mins read
Learn More
SQL MERGE for Data Engineers

SQL MERGE for Data Engineers: Upserts, CDC, and Idempotent Pipelines

SQL MERGE matches incoming rows to existing rows and then updates, inserts, or deletes them in one statement. Data engineers use it to write upsert logic, finish CDC loads, and make repeat runs safe. In data engineering, SQL MERGE helps you keep warehouse tables current without chaining together separate update and insert jobs. It also...

By: Chris Garzon | June 3, 2026 | 10 mins read
Learn More
Data Quality Tests

Data Quality Tests in SQL: Nulls, Duplicates, Ranges, and Referential Integrity

Data quality tests in SQL help you catch bad rows early, before they break dashboards, audits, or machine learning work. The four checks that matter most are nulls, duplicates, range rules, and referential integrity. They work well in Snowflake, BigQuery, Redshift, and Postgres because the logic stays close to the tables. One null customer_id can...

By: Chris Garzon | June 2, 2026 | 9 mins read
Learn More
Slowly Changing Dimensions Type 2

Slowly Changing Dimensions Type 2 with SQL and dbt

A slowly changing dimension type 2 keeps the full history of a dimension row. When a tracked value changes, you close the old row and insert a new one instead of overwriting the past. That matters in analytics because you often need to know what was true on a given date. SQL handles the change...

By: Chris Garzon | June 1, 2026 | 9 mins read
Learn More
Incremental Data Models in dbt

Incremental Data Models in dbt: Append, Merge, and Snapshot Strategies

dbt incremental models load only new or changed rows, so you don’t rebuild a full table on every run. That makes pipelines faster, lowers warehouse cost, and helps large tables stay fresh. In practice, most teams choose between three patterns: append for immutable data, merge for rows that change, and snapshots for history. The right...

By: Chris Garzon | May 31, 2026 | 9 mins read
Learn More
Common Mistakes in a Snowflake Real Time Project

Common Mistakes in a Snowflake Real-Time Project

Most Snowflake real-time projects fail for a simple reason: teams move too fast, skip planning, and treat streaming data like batch data with shorter timing. That works in a demo. It falls apart in production, where late events, duplicates, bad timestamps, and recovery gaps show up fast. If you’re building one of these pipelines, you...

By: Chris Garzon | May 30, 2026 | 9 mins read
Learn More

CDC Pipelines Explained: Debezium, Kafka, and Warehouse MERGE Patterns

A CDC pipeline captures row changes in a source database, publishes those changes as events, and applies them to a warehouse table. Instead of reloading full tables, it moves only inserts, updates, and deletes. If you’re learning cdc pipeline data engineering, this is one of the clearest patterns to understand because it shows how modern...

By: Chris Garzon | May 29, 2026 | 10 mins read
Learn More
building a career

Building a Career in Data Engineering with AI Specialization

Are you considering a switch to data engineering and wondering how AI might fit in? You’re not alone. As AI technologies surge in popularity, the demand for skilled data engineers is rising in tandem. In fact, data engineering roles are projected to grow by 21% by 2028, adding hundreds of thousands of positions. This growth...

By: Chris Garzon | May 29, 2026 | 19 mins read
Learn More

Terraform for Data Engineers: When Infrastructure as Code Matters

Terraform matters when your data stack needs repeatable setup, fewer manual mistakes, and easier teamwork. In data engineering, Terraform helps you create storage, warehouses, permissions, and network pieces with code instead of console clicks. It starts to matter when pipelines move past one-off scripts. Once you have shared environments, cloud complexity, or audit needs, manual...

By: Chris Garzon | May 28, 2026 | 8 mins read
Learn More

OpenLineage and Marquez: Data Lineage for Modern Pipelines

OpenLineage gives teams a standard way to track lineage across tools, and Marquez gives them an open-source place to store and view that lineage. If you’re trying to make OpenLineage data lineage useful in a real platform, this pairing is one of the clearest options. It matters because modern pipelines cross schedulers, SQL models, Spark...

By: Chris Garzon | May 28, 2026 | 9 mins read
Learn More

Airflow in Production: Backfills, Retries, SLAs, and Failed DAG Recovery

Production Airflow problems usually come from three places: bad retry settings, risky backfills, and weak recovery plans after failures. The fix is a short set of airflow production best practices that keep reruns safe, reduce alert noise, and stop duplicate writes. If your DAGs fail at 2 a.m., the hard part is not clicking “clear”...

By: Chris Garzon | May 28, 2026 | 8 mins read
Learn More

Batch vs Streaming vs Micro-Batch: How Data Engineers Choose the Right Pattern

Data engineers choose batch, streaming, or micro-batch by matching the pipeline to business timing. The batch vs streaming data pipeline decision depends on latency, cost, data volume, and what the team can support. Batch runs on schedules, streaming handles events as they arrive, and micro-batch groups small bursts every few seconds or minutes. The best...

By: Chris Garzon | May 28, 2026 | 8 mins read
Learn More
Snowflake Real-Time Project With Streams and Tasks

Snowflake Real-Time Project With Streams and Tasks

You don’t need a heavy streaming stack to get fresh analytics. In many teams, Snowflake Streams and Tasks are enough to move new data through a pipeline every few minutes. That makes this a great project for beginners in data engineering, analysts moving into ELT, and job seekers building a portfolio. New rows land in...

By: Chris Garzon | May 28, 2026 | 11 mins read
Learn More