Build a Real-World DBT Project: Step-by-Step Guide

By: Chris Garzon | June 17, 2025 | 25 mins read

Building a real-world data pipeline may seem daunting, especially for beginners. But with the right tools and a clear roadmap, you can set up a functioning end-to-end project on your own Windows machine. In this guide, we’ll walk you through the process of using VS Code, Docker Desktop, and WSL (Windows Subsystem for Linux) to create a comprehensive data pipeline. You’ll learn to spin up PostgreSQL in Docker, manage Python environments with pyenv in WSL, and configure dedicated environments for dbt and Airflow. We’ll cover how to initiate a dbt project (with well-organized models for sources, staging, and marts) and how to schedule ETL jobs in Airflow that ingest data from an external weather API. Along the way, we’ll also share troubleshooting tips for common errors, editing bash profiles, and dealing with YAML configuration. By the end, you’ll have a mini “real-world” project running – and the confidence to expand on it.

Seeing is believing: Curious About Real Results? Check out some of our Data Engineer Academy student success stories to see how these principles translate into real job offers.

Real Results

Overview of the Tech Stack

To build our pipeline, we’ll use a combination of industry-standard tools. Each tool in the stack has a specific role:

Tool/Component	Purpose in the Pipeline
WSL (Windows Subsystem for Linux)	Provides a Linux environment on Windows for running CLI tools, Python, etc., in a native-like way.
VS Code + Remote WSL	Code editor for writing SQL models, Python scripts, etc., with seamless access to the WSL filesystem.
Docker Desktop	Runs Docker containers on Windows (using WSL2 backend) for services like PostgreSQL (our database).
PostgreSQL (Docker container)	Acts as our analytics data warehouse – storing raw and transformed data. Running it in Docker makes setup and teardown easy.
pyenv + pyenv-virtualenv	Manages multiple Python versions and virtual environments. We’ll use this to isolate dbt and Airflow in separate environments, avoiding dependency conflicts.
dbt (Data Build Tool)	Handles data transformations. We will use dbt Core (CLI) to organize SQL queries that transform raw data into cleaned, usable tables in Postgres.
Apache Airflow	Orchestrates the workflow. Airflow will schedule and run tasks: e.g., fetching data from the API and kicking off dbt to transform that data, all on a schedule.
External Weather API	Serves as the data source for our ETL. Airflow will fetch data from this API (for example, daily weather metrics) to ingest into our database.

Figure: High-level data pipeline architecture. Airflow (left) extracts data from an external source (e.g., a weather API) and loads it into a PostgreSQL data warehouse (center). DBT then transforms the raw data in the warehouse into refined tables (right) for analysis or reporting.

Before diving into setup, ensure you have Docker Desktop installed (and running) on your Windows PC and have enabled WSL2 integration. Also, install VS Code with the Remote – WSL extension so you can open your WSL environment in VS Code for a smoother experience. We assume you have set up WSL (e.g., Ubuntu distribution) and updated it to work with Docker Desktop. With these in place, let’s start building!

Environment Setup on Windows (WSL, VS Code, Docker)

1. Prepare WSL and VS Code: Open VS Code and connect to your WSL instance (e.g., Ubuntu) via the Remote WSL extension. This gives you a Linux terminal inside VS Code. Update packages (sudo apt update && sudo apt upgrade and ensure basic tools are installed. This WSL environment is where you’ll execute most commands.

2. Install Docker Desktop: Docker Desktop for Windows will allow you to run Linux containers. In Docker settings, enable WSL2 backend and integrate it with your Ubuntu WSL. After installation, you should be able to run docker commands from the WSL terminal. Test it by running docker --version. If WSL is properly set up with Docker Desktop, Docker commands in WSL will execute against the Docker Desktop engine.

3. Folder Structure: Create a project directory in WSL to keep things organized. For example, in your WSL home directory, you might create a folder projects/real_world_dbt_demo (you can name it as you like). Within this, we will have subfolders for different components (database, dbt, airflow, etc.) as we proceed. Organizing code in a dedicated project folder is a common practice to keep things tidy.

Setting Up PostgreSQL in Docker

We’ll use PostgreSQL as our database. Instead of installing it manually, we can run Postgres in a Docker container for simplicity. Docker containers provide isolated environments that are easy to start or remove, much like running an app without permanently altering your system.

1. Create a Docker Compose file: In your project directory (e.g., real_world_dbt_demo), create a folder for the database setup, say postgres/. Inside that, create a file named docker-compose.yml. This YAML file will define a Postgres service. For example, you can add the following content:

version: '3'
services:
  db:
    image: postgres:14-alpine
    ports:
      - "5432:5432"
    environment:
      POSTGRES_DB: demo_dw
      POSTGRES_USER: demo_user
      POSTGRES_PASSWORD: demo_pass

This configuration tells Docker to download the official Postgres image (version 14 on Alpine Linux) and run it. We map the container’s port 5432 to 5432 on our host (so that WSL can access Postgres via localhost:5432). We also set some environment variables to initialize the database: it will create a database named demo_dw and a user demo_user with a password demo_pass. Feel free to change the names/password, but remember them for configuring dbt later.

2. Launch the Postgres container: Open a WSL terminal in the postgres/ directory and run:

docker-compose up -d

This will start the Postgres database in the background (detached mode). The first time, Docker will pull the Postgres image, which may take a few moments. After it finishes, verify that the container is running with docker ps (you should see a container listed for the Postgres image). By default, the database will listen on port 5432.

3. Test the connection: You can use the psql client or any DB tool to test connecting to localhost:5432 with the credentials in the compose file (user: demo_user, password: demo_pass, database: demo_dw). If you’re using WSL, you might need to install the psql client (sudo apt install postgresql-client) to test via terminal. This step isn’t strictly necessary now, but it’s a good sanity check that your database is up. We’ll formally test connectivity through dbt in a later step.

Managing Python with pyenv in WSL

Next, we’ll set up Python and virtual environments for our two main Python-based tools, dbt and Airflow. We’ll use pyenv to manage Python versions, and pyenv-virtualenv to create isolated environments for each tool. This lets us avoid version conflicts (for instance, specific versions of Airflow and dbt might require different dependency versions, so isolating them prevents clashes).

1. Install pyenv and pyenv-virtualenv: In your WSL terminal, install the prerequisites for building Python (if not already installed): sudo apt install -y build-essential libssl-dev zlib1g-dev libreadline-dev libbz2-dev libsqlite3-dev. Then, install pyenv. One easy method on Ubuntu is via the pyenv-installer script or using Homebrew. For example, you can use Homebrew (if you have it in WSL) to install:

# Install Homebrew if not installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Use Homebrew to install pyenv and pyenv-virtualenv
brew install pyenv
brew install pyenv-virtualenv

Alternatively, use the pyenv installer script from the pyenv GitHub, which sets everything up. Either way, once installed, you need to integrate pyenv with your shell.

2. Update your bash profile: Add the following lines to your shell profile so that pyenv activates each time you open WSL. If you’re using the default Bash in Ubuntu, edit ~/.bashrc (open it in VS Code or use nano ~/.bashrc). Append these lines:

export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

This ensures the pyenv command and its shims are in your PATH, and it initializes pyenv and the virtualenv plugin on shell startup. Save the file and restart your terminal (or run source ~/.bashrc) to apply changes. You can test by running pyenv -v – It should output a version, confirming pyenv is ready.

3. Install Python via pyenv: With pyenv, you can install specific Python versions easily. For this project, Python 3.9 or 3.10 is are good choice (both dbt and Airflow support these versions as of 2025). For example, install Python 3.10.x (we’ll use 3.10.4 in this example):

pyenv install 3.10.4

This will download and compile Python 3.10.4. It may take a few minutes. Once done, you have that Python version available under pyenv.

4. Create virtual environments: Now we’ll create two separate virtual environments using this Python version – one for dbt and one for Airflow. We’ll use pyenv-virtualenv for convenience:

Create a virtualenv for dbt (let’s call it dbt_env):

pyenv virtualenv 3.10.4 dbt_env

Create a virtualenv for Airflow (call it airflow_env):

pyenv virtualenv 3.10.4 airflow_env

This provides us with two isolated environments, both utilizing Python 3.10.4 under the hood. Now, we can associate these environments with folders for auto-activation. For instance, create subfolders in your project directory for dbt and Airflow config (if not already): mkdir -p ~/projects/real_world_dbt_demo/dbt and mkdir -p ~/projects/real_world_dbt_demo/airflow. Then, inside each, set the local pyenv version:

# For dbt folder:
cd ~/projects/real_world_dbt_demo/dbt && pyenv local dbt_env

# For airflow folder:
cd ~/projects/real_world_dbt_demo/airflow && pyenv local airflow_env

By doing this, whenever you cd into the dbt folder, pyenv will automatically activate the dbt_env environment, and similarly for the airflow folder with airflow_env. This is a handy trick to avoid manually activating environments each time.

Installing and Configuring dbt (Data Build Tool)

With our Python environment ready, let’s set up dbt in the dbt_env. Ensure that your terminal is in the ~/projects/real_world_dbt_demo/dbt directory so that the dbt_env is activated (your prompt might indicate the active pyenv). You can confirm by running pyenv version – It should show dbt_env If active.

1. Install dbt: dbt consists of a core and adapters for specific databases. Since we’re using Postgres, install the dbt-postgres package (which includes dbt Core and the Postgres adapter). Run:

pip install dbt-postgres

This will install dbt and its dependencies. (Optional: to ensure a specific version, you could append ==<version>.) If the install fails with an error about psycopg2 or missing libpq-devIt means the Postgres client libraries weren’t found to build the adapter. On Ubuntu, the fix is to install the dev package: sudo apt install libpq-dev and then rerun the pip install. Once installation succeeds, you can test by running dbt --version.

2. Initialize a dbt project: A dbt project is a directory with a specific structure where your models (SQL files) and config live. Let’s create one. Still in the dbt directory, run:

dbt init

This command will prompt you for some information to set up the project. It typically asks for: a project name, the target database type, and connection details. Provide a project name (e.g., weather_demo), choose Postgres when prompted for adapter, and then it may ask for connection info (if it doesn’t, we will set it manually in a moment). If prompted, enter the Postgres connection details we set up earlier:

host: localhost
port: 5432 (or the port you mapped, e.g., 5432 or 5000)
user: demo_user
pass: demo_pass
dbname: demo_dw
schema: public (or you can use a specific schema like dev and prod for different targets)
threads: 1 (you can keep this as default for now)

The init process will create a new folder (named after your project, e.g., weather_demo/) inside the current directory with some sample files and folders. Your project directory might now look like this:

real_world_dbt_demo/
├── dbt/
│   └── weather_demo/
│       ├── dbt_project.yml
│       ├── models/
│       │   └── example/        # example models provided by dbt init
│       ├── logs/
│       ├── snapshots/
│       └── ...
└── postgres/
    └── docker-compose.yml

The dbt_project.yml defines base settings for your dbt project (name, paths, etc.), and the models/ directory is where you will write SQL models. The example models (like my_first_dbt_model.sql) are just placeholders. We’ll reorganize these soon.

If dbt init did not prompt for database credentials (newer versions sometimes skip interactive profile setup), you’ll need to configure the connection manually. DBT uses a profiles.yml file (by default located in ~/.dbt/) to store connection info. You can create or edit this file. For example, open ~/.dbt/profiles.yml and add a profile matching your project name. It might look like:

weather_demo:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      port: 5432
      user: demo_user
      pass: demo_pass
      dbname: demo_dw
      schema: public
      threads: 1

This profile is named weather_demo (matching the project) and defines a connection target “dev” using our Postgres credentials. Now, when you run dbt commands in the project, it knows how to connect to the database.

3. Test the dbt connection: Navigate into the project directory (cd weather_demo) and run:

dbt debug

This will check that the profiles file is found and that it can connect to the database. You should see checks passing (OK) and a final “Connection test: [OK]” if all is well. If there are issues:

If it says profile not found or invalid, ensure the profile name in profiles.yml matches your project name.
If the connection fails, double-check that the container is running and the creds/port are correct.

4. Organize dbt models into layers: A real-world dbt project is typically organized into layers of models. We will follow a common best practice of using three layers in the models directory: sources, staging, and marts (also sometimes called production or analytics models).

Sources layer: This is where we define our external data sources (usually just as YAML definitions, not actual SQL). In our case, the external source is the Weather API data that we will be loading into a raw table in Postgres. We can represent that in DBT as a source. Create a folder models/sources/. Inside it, create a YAML file (e.g., weather_api_sources.yml) where we declare the source. For example:

version: 2
sources:
  - name: weather_api        # source name
    tables:
      - name: daily_weather  # this will correspond to a table in the database
        description: Raw weather data fetched from the API, loaded via Airflow.

This tells dbt to expect a source named “weather_api” with a table “daily_weather”. Later, we’ll ensure Airflow creates and populates a table in Postgres named daily_weather. By declaring it as a source, we can reference it in our models and use dbt’s testing/documentation features on it.

Staging layer: Staging models are where we do light transformations on a 1-to-1 basis. Create models/staging/ a folder. For each source table, we might have a staging model. For example, we can create models/staging/stg_weather.sql which selects and cleans data from the weather_api.daily_weather source. A simple example of such a model SQL might be:

{{ config(materialized='table') }}
select
  date,
  city,
  temp_kelvin,
  -- Convert temperature to Celsius for easier understanding
  temp_kelvin - 273.15 as temp_celsius,
  humidity,
  description
from {{ source('weather_api', 'daily_weather') }};

Here we use {{ source('weather_api','daily_weather') }} to pull from the raw source table. We might convert temperature from Kelvin to Celsius, select only columns we need, etc., in this staging layer. Staging models should be focused on one source and prepare the data for further modeling. We also specify materialized='table' just to have it materialize as a physical table (for simplicity; by default, dbt might create views).

Mart’s layer: Marts (or business marts) are the final models that are typically business-facing or ready for analytics. They might join multiple staging models or aggregate data to form useful tables. Create models/marts/. For instance, if our goal was to create a summary of daily weather by city, we could have a model models/marts/weather_summary.sql that aggregates data from stg_weather. Example:

{{ config(materialized='table') }}
select 
  city,
  date,
  ROUND(AVG(temp_celsius)::numeric, 2) as avg_temp_celsius,
  MAX(humidity) as max_humidity,
  MIN(humidity) as min_humidity
from {{ ref('stg_weather') }}
group by city, date;

This uses {{ ref('stg_weather') }} to reference the staging model (dbt will ensure dependencies run in the right order). It calculates the average temperature and humidity range per city per date.

The exact SQL isn’t important for our guide – the key is understanding the structure:
sources (YAML definitions) → staging models (one per source table, basic cleaning) → mart models (business logic, joining or aggregating staging models). Adopting this layered approach makes your project easier to maintain and test.

Don’t forget to also create corresponding YAML files to document and test your staging and mart models (dbt usually expects one schema.yml in each folder). For example in models/staging/, create a schema.yml to define tests (maybe uniqueness tests on primary keys, etc.) for stg_weather. These YAML files improve project documentation and data quality checks, and they are a common source of errors if formatting is off. Always pay attention to YAML indentation and syntax – a missing space can break things.

5. Update dbt_project.yml (if needed): By default, they dbt_project.yml might have models configured with certain subpaths (the example project might have a example directory configured). You can edit dbt_project.yml to include your new folders or use the default wildcard, which picks up all models in the project. For instance, you might ensure it has:

models:
  weather_demo:
    +materialized: view  # default materialization
    staging:
      +materialized: table
    marts:
      +materialized: table

This would materialize models in staging and marts as tables by default, if you want that. This step is optional for the function, but it shows you can tweak settings globally.

6. Run dbt models: Now that we have organized models (and after Airflow loads some data), you can execute dbt run to build the models, and dbt test to run tests. Initially, since the source table might be empty or not yet loaded, these runs may not do much, but setting this up now means once data is there, I dbt run will create the staging and mart tables accordingly. You can also use dbt seed it if you have seed CSVs (not in our case) or dbt docs generate to build documentation.

At this stage, the core of our DBT project is ready. We have a database, a transformation pipeline defined in dbt, and next we want to automate data ingestion and running those transforms on a schedule.

Scheduling ETL with Airflow

Now comes the orchestration layer: Apache Airflow will coordinate when to fetch new data and when to run dbt. We’ll use the second Python environment airflow_env for this. Activate it by switching to the ~/projects/real_world_dbt_demo/airflow directory (pyenv should auto-activate the environment, or use pyenv activate airflow_env). Ensure you’re not in the dbt directory anymore so that you don’t accidentally install Airflow into the wrong environment.

1. Install Airflow: Airflow can be installed via pip. It’s recommended to pin a version and use constraint files because Airflow has many dependencies. For example, to install Airflow 2.6.3, you could run:

pip install "apache-airflow==2.6.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.6.3/constraints-3.10.txt"

This --constraint URL ensures that the exact versions of sub-dependencies compatible with Airflow 2.6.3 and Python 3.10 are used. (If this looks scary, it’s a requirements file provided by Airflow maintainers – including it avoids a lot of installation headaches.) The above command might take a while as it installs Airflow and all the needed packages.

Once done, verify by running airflow version. If it prints a version, you’ve got Airflow installed in this environment.

2. Set Airflow home (optional): By default, Airflow will use ~/airflow/ as its home directory (where it stores configs, the default SQLite database, etc.). You can leave it as the default, or explicitly set an environment variable AIRFLOW_HOME. For example, to use the project’s airflow folder as home, you could export AIRFLOW_HOME=~/projects/real_world_dbt_demo/airflow/ in your terminal before initializing Airflow. You might also add this to your ~/.bashrc so it’s permanent. In our case, it’s fine to use the default (which will be ~/airflow in WSL home) to keep things simple.

3. Initialize and run Airflow: Airflow needs to initialize a metadata database and some default configs. The easiest way (for Airflow 2.2+ and above) is to run the standalone command:

airflow standalone

This will do a quick init (set up a SQLite db for Airflow’s metadata and create a default user) and start the Airflow web server on localhost:8080. After a few seconds, you should see logs saying Airflow is ready, and it will print out a URL and the login credentials (username and password). By default, the username admin and password are an auto-generated alphanumeric string in the log. Leave this terminal running — it’s the Airflow web server.

Open a web browser in Windows (or use VS Code’s browser if available) and go to http://localhost:8080. You should see the Airflow UI. Log in with the credentials from the log. If everything is correct, you’ll see the Airflow Dashboard with some example DAGs (which you can ignore or turn off).

4. Create a DAG for our pipeline: Now it’s time to write the Airflow workflow (DAG). Our DAG will do two main things: (a) call the Weather API and load data into Postgres, (b) trigger a dbt run to transform that new data.

Airflow DAGs are defined in Python scripts. In your Airflow home (which by default is ~/airflow/There should be a folder called dags/. If not, create it (mkdir ~/airflow/dags). Inside, create a file weather_dbt_pipeline.py you can do this in VS Code for ease of editing, just ensure it gets saved into the WSL ~/airflow/dags directory or the appropriate AIRFLOW_HOME dags folder.

For our example, we’ll outline a simple DAG code here:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
import requests
import psycopg2

# DAG definition
default_args = {
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}
with DAG('weather_dbt_pipeline',
         start_date=datetime(2023, 1, 1),
         schedule_interval='@daily',
         default_args=default_args,
         catchup=False) as dag:

    def fetch_weather_data():
        """Python callable to get weather data from API and insert into Postgres."""
        # Example API call (replace with real API and params)
        resp = requests.get("https://api.open-meteo.com/v1/forecast?latitude=52&longitude=4.9&daily=temperature_2m_max&timezone=auto")
        data = resp.json()
        # Connect to Postgres and insert data (simplified example)
        conn = psycopg2.connect(host="localhost", port=5432, dbname="demo_dw", user="demo_user", password="demo_pass")
        cur = conn.cursor()
        # Assume data contains a list of daily records:
        for record in data["daily"]["temperature_2m_max"]:
            # For simplicity, let's assume record has 'date' and 'temp' (you'd adjust based on actual API structure)
            cur.execute("INSERT INTO daily_weather (date, temp_kelvin) VALUES (%s, %s)", (record["date"], record["temp"] + 273.15))
        conn.commit()
        cur.close()
        conn.close()

    fetch_task = PythonOperator(
        task_id='fetch_weather',
        python_callable=fetch_weather_data
    )

    dbt_task = BashOperator(
        task_id='run_dbt_models',
        bash_command="cd ~/projects/real_world_dbt_demo/dbt/weather_demo && dbt run"
    )

    fetch_task >> dbt_task

Let’s break down what this DAG does:

It uses a PythonOperator (fetch_task) to execute a Python function fetch_weather_data. In that function, we use the requests library to call a weather API (the example uses a free Open-Meteo API – you could use any weather API of your choice). We then parse the JSON and insert records into the daily_weather table in Postgres using psycopg2. We keep it simple: for each record (each day’s forecast), we insert a new row. In practice, you’d handle duplicates or updates accordingly, but as a beginner demo, we assume a fresh insert each run. Make sure to adjust the data parsing to match the actual API’s response structure, and you might need an API key for some services (store it securely if so).
The second task is a BashOperator (dbt_task) that runs the dbt command line. We cd into the dbt project directory and run dbt run, which will execute our models (staging and marts) to transform the newly inserted data. We rely on the fact that the dbt_env is active for that directory. If your Airflow environment does not have dbt installed, this command would fail. One approach, if keeping environments separate, is to activate the dbt venv within the Bash command. For example, you could do: bash_command='cd ... && eval "$(pyenv init -)" && pyenv activate dbt_env && dbt run' to explicitly load pyenv and the other environment. This is a bit advanced, but it shows you can activate the DBT environment inside the Airflow task so it has access to the dbt executable. Alternatively, installing dbt-postgres in the Airflow environment as well (provided versions are compatible) is a simpler route for a small project.
We set fetch_task >> dbt_task to establish that the fetch must succeed before the DBT task runs. So, each day when the DAG triggers, it will fetch new data and then run transformations.
The DAG is scheduled to run daily (schedule_interval='@daily'), and we set catchup=False to avoid back-filling old dates. You can adjust the schedule as needed (e.g., hourly or once a week). The start_date is set in the past (January 1, 2023, in this example) so that the scheduler can start running it immediately (Airflow will not run a DAG for dates before the start_date).

Save this file and wait a minute or two. Airflow’s scheduler (running as part of airflow standalone) should automatically detect the new DAG file. Refresh the Airflow UI in your browser – you should now see a new DAG named weather_dbt_pipeline (or whatever dag_id you gave). If it’s not appearing, check the Airflow scheduler logs; there might be a syntax error in the file or a missing dependency (for example, ensure you installed the requests and psycopg2 libraries in the Airflow environment via pip, since we used them in the DAG code). If there’s an error, the DAG will be broken (shown in red).

Once the DAG is visible and not broken, you can trigger it manually in the Airflow UI (turn it on, then click “Run” for an immediate run). Monitor the tasks: the fetch task should run, and then the dbt task. If all goes well, the fetch task will populate the daily_weather table in Postgres, and then dbt will run to build the staging and mart models from that data. You can connect to Postgres and query the tables to verify: you should see data in daily_weather (raw table from API), in stg_weather (transformed), and in weather_summary (aggregated results), Assuming you followed the model examples. The Airflow UI will show the DAG run as successful (green) if both tasks succeed.

With the Airflow DAG in place, you’ve essentially built an ELT pipeline: Extract and Load via Airflow (PythonOperator fetching from API, loading to database), then Transform via dbt (executed by Airflow’s BashOperator). This separation of concerns is powerful. Airflow orchestrates and can do any Python work or API calls, while dbt focuses on set-based SQL transformations in the warehouse.

Final Thoughts and Next Steps

Congratulations! You’ve set up a miniature real-world data engineering project. To recap, we configured a development environment on Windows using WSL for a Linux-like experience, ran a Postgres database in a Docker container, managed isolated Python environments for our tools, created a dbt project with logical model layers, and automated the pipeline with Airflow. This is a lot of moving pieces – if it’s not all running perfectly on the first try, that’s normal. Debugging and troubleshooting are part of the learning process. Through this project, you’ve touched on many common tasks a data engineer faces:

Editing configuration files (YAML for dbt models and Docker Compose).
Dealing with environment variables and PATH issues in a Linux environment (updating bash profiles for pyenv, etc.).
Writing SQL transformations and organizing them for maintainability.
Writing Python code to interface with external APIs and databases.
Using a scheduler to orchestrate workflows and handle dependencies between tasks.

Each of these components can be deepened further. For instance, you could add error handling or data validation in the ingestion step, incorporate dbt tests more extensively (and fail the Airflow job if tests fail), or containerize Airflow itself using Docker Compose for a more production-like deployment. You could also swap the data source or add another one, and let dbt combine multiple sources in the transformation layer.

When you’re ready to start your own success story, we’re here to help you Land Your Dream Job – click to book an onboarding call with our team and take the first step toward your data engineering career.

Book A Call

Chris Garzon

Christopher Garzon has worked as a data engineer for Amazon, Lyft, and an asset management start up where he was responsible for building the entire Data Infrastructure from scratch. He is the author “Ace the Data Engineer Interview” and has helped 100’s of students break into the data engineer industry. He is also an angel investor, an advisor to multiple to multiple start ups, and the founder and CEO of Data Engineer Academy.