Managing and combining sales data from different countries is never as simple as copying files. Every country might send you files in a different format—CSV from India, Parquet from the USA, JSON from France. Not only does this slow things down, but it can also create a mess if you try to analyze it all together. Cue the need for a smart, automated ETL pipeline.

In this post, you’ll see how to bring order to the chaos using AWS Glue, Amazon S3, Snowflake, and Snowpark. This solution grabs files from a GitHub repo, loads them into the cloud, transforms them so everything lines up just right, and sets you up for fast reporting and analytics—all with just a few clicks and scripts.

We’ve got detailed, step-by-step instructions ahead, with everything you need to build this as your own data engineering project. If you want to add a standout project to your resume, you’re in the right place.

Project Objective and Technology Stack

The main goal here: automate extracting, transforming, and loading (ETL) sales data for multiple countries into Snowflake, using AWS Glue and Snowpark.

Why go through all this effort? Companies rarely have tidy, matching files from every source. You get CSVs, JSON, Parquet, and more. Making sense of them all takes more than copy-paste. With this pipeline, you can blend everything into one clean, analytics-ready layer.

Here’s the essential toolkit you’ll use for this project:

This project will:

Here’s a snapshot of what you’re handling:

Don’t let file formats dictate your process. You want your pipeline to handle all of them without any manual work.

Country Sales Data: What Are We Dealing With?

Let’s break down the sources:

CountryFile FormatSample Fields
IndiaCSVorder ID, customer name, mobile model, quantity, unit price, total price, promo code, order date, payment/shipping details
USAParquetSame fields, but different structure and naming
FranceJSONSame fields, but in JSON style

Each file—with its own quirks—lands in a folder on a GitHub repository. What’s actually in each file? Almost always:

You’ll learn not only how to extract each file but also how to make them line up nicely. With this setup, anytime one of your countries sends something new, it goes through the same clean, automated process.

This isn’t just about saving time; it’s about getting to better decisions faster, with data that’s always up-to-date and ready to query.

Setting Up AWS Components for the Data Pipeline

Let’s get your cloud ducks in a row. No fancy tools required. If you’ve got AWS credentials and basic Snowflake access, you’re all set.

Step 1: Create the IAM Role for AWS Glue

Setting permissions right is often the biggest point of failure. Don’t skip this step.

Here’s how to give AWS Glue access to S3 and CloudWatch logs:

  1. Log into your AWS Management Console.
  2. In the search bar, type IAM and go to the IAM dashboard.
  3. Click Roles on the left, then choose Create role.
  4. Under Trusted entity type, pick AWS service.
  5. Select Glue as the service that will use this role.
  6. Click Next.
  7. Attach these policies:
    • AmazonS3FullAccess
    • CloudWatchFullAccess
  8. Continue, name your role something like glue_snowpark_role.
  9. Review your settings, then click Create role.

This role is what lets your Glue job move files and log errors for later debugging. It’s the foundation.

Step 2: Set Up Your S3 Bucket

S3 buckets are the digital equivalent of a big filing cabinet. You’ll store all country files here before moving them into Snowflake.

To create your data bucket:

  1. From the AWS Console, find and open S3.
  2. Click Create bucket.
  3. Enter a name like de-academy-snowpark-data-bucket.
  4. You can leave most settings as default for this project.
  5. Click Create bucket and you’re done.

Now you have a central spot ready for your incoming files.

Step 3: Build the AWS Glue Job

This job fetches your data from GitHub and puts it in your S3 bucket. You’ll use a Python script loaded right into AWS Glue Studio.

Here’s the high-level process:

  1. Go to AWS Glue in the console.
  2. Open ETL jobs and choose to create a new job.
  3. Select Script editor, use Python shell as the engine, and start fresh.
  4. Paste your Python script in (this connects to the repo, grabs files, uploads them to S3).
  5. Change the bucket name in the script to the one you just created.
  6. Assign your glue_snowpark_role.
  7. Save and run the job.

Most of the script logic is:

To confirm success, head over to your S3 bucket. If you see those three files (one for each country), you’re golden.

If you’re new to AWS Glue, get up to speed with their intro guides.

Configuring Snowflake IAM Role and Integration

That covers moving files into the cloud, but Snowflake needs its own permission to read from your S3 bucket. Here’s how you grant it.

Step 1: Create Snowflake’s IAM Role in AWS

  1. Log in to AWS Console, search for IAM, and click Roles.
  2. Click Create role.
  3. Select AWS Account as the trusted entity.
  4. Check the box for Require external ID and enter a placeholder like “0000”.
  5. Attach AmazonS3FullAccess.
  6. Name it something easy to remember, like snowflake_snowpark_role.
  7. Create the role.

What’s this “external ID” about? It’s an extra security step so only Snowflake (as a third party) can use this role.

Step 2: Set Up Your Snowflake Integration

Now pop over to your Snowflake account. You’ll build a database and four schemas.

Here’s an example workflow:

CREATE OR REPLACE DATABASE SNOWPARK_DB;

USE DATABASE SNOWPARK_DB;

CREATE SCHEMA STAGING;
CREATE SCHEMA RAW;
CREATE SCHEMA TRANSFORMED;
CREATE SCHEMA CURATED;

Next, create tables within your staging schema that mirror the structure of your files. For JSON and Parquet, use the VARIANT data type.

Once your schemas and tables are set, it’s time to hook up Snowflake to S3 so it can pull the files.

Storage Integration

  1. Get your AWS snowflake_snowpark_role ARN from AWS Console.
  2. Get your S3 bucket URI.
  3. In Snowflake, create a storage integration:
CREATE OR REPLACE STORAGE INTEGRATION SNOWPARK_INT
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = '<your-snowflake-snowpark-role-arn>'
  STORAGE_ALLOWED_LOCATIONS = ('s3://your-bucket-name/');
  1. Run DESC INTEGRATION SNOWPARK_INT; to get the Snowflake IAM user ARN and the generated external ID.
  2. Go back to AWS and update the trust policy for your IAM role:
    • Replace the placeholder external ID with the one from Snowflake.
    • Add the Snowflake IAM user ARN to the policy.

Mutual trust is mandatory; Snowflake won’t be able to read your S3 files without it.

Create the External Stage

CREATE OR REPLACE STAGE SNOWPARK_STAGE
  URL='s3://your-bucket-name/'
  STORAGE_INTEGRATION=SNOWPARK_INT;

Test your connection. List the files you uploaded:

LIST @SNOWPARK_STAGE;

If you see your three country files appear, your integration is working.

Write and Run Your Snowpark Data Pipelines

This is where the real power of Snowpark kicks in. You’ll write Python code that lives inside Snowflake, so your transformations scale as needed.

Loading Data from S3 into Staging and Raw Tables

Check by looking in the raw schema for newly filled tables.

Transform Data for Analytics

Time to standardize everything. Here’s what happens in the Python Snowpark script:

Once transformations are done, you union all three datasets into a new table in the transformed schema called global_sales_order. This is your unified, clean master sales table.

Build Your Curated Data Tables for Reporting

Now, create three new summaries with fresh Python Snowpark notebooks:

Within each script, you’ll use Snowpark’s DataFrame API for filtering, grouping, and aggregation.

Examples:

After writing each DataFrame, save it as a new table in the curated schema.

Check your curated schema for the three new tables. Preview them right in Snowflake—now, you have analytics-ready datasets.

What Have You Accomplished?

This project covered the whole chain, from file arrival to business-ready reports. Here’s what you now know how to do:

By building this, you’re not just learning tools—you’re creating a proven, resume-ready data engineering project. You’ll be ready to add this to your portfolio, showing off real-world skills that hiring managers want.

Here’s what you can highlight:

Extra Resources to Keep Learning

Level up your skills with these links:

And if you’re ready to take your data engineering career up a notch, check out the Data Engineer Academy’s resources for project ideas and learning paths.


Get your hands dirty with this guide, and you’ll be ready to automate, clean, and analyze sales data no matter where—or in what shape—it arrives.

Real stories of student success

Frequently asked questions

Haven’t found what you’re looking for? Contact us at [email protected] — we’re here to help.

What is the Data Engineering Academy?

Data Engineering Academy is created by FAANG data engineers with decades of experience in hiring, managing, and training data engineers at FAANG companies. We know that it can be overwhelming to follow advice from reddit, google, or online certificates, so we’ve condensed everything that you need to learn data engineering while ALSO studying for the DE interview.

What is the curriculum like?

We understand technology is always changing, so learning the fundamentals is the way to go. You will have many interview questions in SQL, Python Algo and Python Dataframes (Pandas). From there, you will also have real life Data modeling and System Design questions. Finally, you will have real world AWS projects where you will get exposure to 30+ tools that are relevant to today’s industry. See here for further details on curriculum  

How is DE Academy different from other courses?

DE Academy is not a traditional course, but rather emphasizes practical, hands-on learning experiences. The curriculum of DE Academy is developed in collaboration with industry experts and professionals. We know how to start your data engineering journey while ALSO studying for the job interview. We know it’s best to learn from real world projects that take weeks to complete instead of spending years with masters, certificates, etc.

Do you offer any 1-1 help?

Yes, we provide personal guidance, resume review, negotiation help and much more to go along with your data engineering training to get you to your next goal. If interested, reach out to [email protected]

Does Data Engineering Academy offer certification upon completion?

Yes! But only for our private clients and not for the digital package as our certificate holds value when companies see it on your resume.

What is the best way to learn data engineering?

The best way is to learn from the best data engineering courses while also studying for the data engineer interview.

Is it hard to become a data engineer?

Any transition in life has its challenges, but taking a data engineer online course is easier with the proper guidance from our FAANG coaches.

What are the job prospects for data engineers?

The data engineer job role is growing rapidly, as can be seen by google trends, with an entry level data engineer earning well over the 6-figure mark.

What are some common data engineer interview questions?

SQL and data modeling are the most common, but learning how to ace the SQL portion of the data engineer interview is just as important as learning SQL itself.