Managing and combining sales data from different countries is never as simple as copying files. Every country might send you files in a different format—CSV from India, Parquet from the USA, JSON from France. Not only does this slow things down, but it can also create a mess if you try to analyze it all together. Cue the need for a smart, automated ETL pipeline.
In this post, you’ll see how to bring order to the chaos using AWS Glue, Amazon S3, Snowflake, and Snowpark. This solution grabs files from a GitHub repo, loads them into the cloud, transforms them so everything lines up just right, and sets you up for fast reporting and analytics—all with just a few clicks and scripts.
We’ve got detailed, step-by-step instructions ahead, with everything you need to build this as your own data engineering project. If you want to add a standout project to your resume, you’re in the right place.
Project Objective and Technology Stack
The main goal here: automate extracting, transforming, and loading (ETL) sales data for multiple countries into Snowflake, using AWS Glue and Snowpark.
Why go through all this effort? Companies rarely have tidy, matching files from every source. You get CSVs, JSON, Parquet, and more. Making sense of them all takes more than copy-paste. With this pipeline, you can blend everything into one clean, analytics-ready layer.
Here’s the essential toolkit you’ll use for this project:
- AWS Glue: Runs Python scripts for ETL logic and orchestration
- Amazon S3: Stores all incoming and processed data files
- Snowflake: Central data warehouse, ready for SQL queries or BI dashboards
- Snowpark: Lets you write Python transformations that run right inside Snowflake
This project will:
- Pull in three countries’ sales orders (India, USA, France), each in a different file format
- Automate transformation, unifying formats and column names
- Aggregate and prepare data for final analytics—all in the cloud, all repeatable
Here’s a snapshot of what you’re handling:
- File Formats:
- CSV for India
- Parquet for the USA
- JSON for France
Don’t let file formats dictate your process. You want your pipeline to handle all of them without any manual work.
Country Sales Data: What Are We Dealing With?
Let’s break down the sources:
Country | File Format | Sample Fields |
---|---|---|
India | CSV | order ID, customer name, mobile model, quantity, unit price, total price, promo code, order date, payment/shipping details |
USA | Parquet | Same fields, but different structure and naming |
France | JSON | Same fields, but in JSON style |
Each file—with its own quirks—lands in a folder on a GitHub repository. What’s actually in each file? Almost always:
- Order number
- Customer details (name, sometimes contact info)
- Product purchased, model, specs
- How many units, prices, and discounts
- Payment info and shipping status
You’ll learn not only how to extract each file but also how to make them line up nicely. With this setup, anytime one of your countries sends something new, it goes through the same clean, automated process.
This isn’t just about saving time; it’s about getting to better decisions faster, with data that’s always up-to-date and ready to query.
Setting Up AWS Components for the Data Pipeline
Let’s get your cloud ducks in a row. No fancy tools required. If you’ve got AWS credentials and basic Snowflake access, you’re all set.
Step 1: Create the IAM Role for AWS Glue
Setting permissions right is often the biggest point of failure. Don’t skip this step.
Here’s how to give AWS Glue access to S3 and CloudWatch logs:
- Log into your AWS Management Console.
- In the search bar, type IAM and go to the IAM dashboard.
- Click Roles on the left, then choose Create role.
- Under Trusted entity type, pick AWS service.
- Select Glue as the service that will use this role.
- Click Next.
- Attach these policies:
- AmazonS3FullAccess
- CloudWatchFullAccess
- Continue, name your role something like
glue_snowpark_role
. - Review your settings, then click Create role.
This role is what lets your Glue job move files and log errors for later debugging. It’s the foundation.
Step 2: Set Up Your S3 Bucket
S3 buckets are the digital equivalent of a big filing cabinet. You’ll store all country files here before moving them into Snowflake.
To create your data bucket:
- From the AWS Console, find and open S3.
- Click Create bucket.
- Enter a name like
de-academy-snowpark-data-bucket
. - You can leave most settings as default for this project.
- Click Create bucket and you’re done.
Now you have a central spot ready for your incoming files.
Step 3: Build the AWS Glue Job
This job fetches your data from GitHub and puts it in your S3 bucket. You’ll use a Python script loaded right into AWS Glue Studio.
Here’s the high-level process:
- Go to AWS Glue in the console.
- Open ETL jobs and choose to create a new job.
- Select Script editor, use Python shell as the engine, and start fresh.
- Paste your Python script in (this connects to the repo, grabs files, uploads them to S3).
- Change the bucket name in the script to the one you just created.
- Assign your
glue_snowpark_role
. - Save and run the job.
Most of the script logic is:
- Importing required libraries like boto3 for AWS connections
- Connecting to the GitHub repo and reading files
- Pushing those files into your S3 bucket
To confirm success, head over to your S3 bucket. If you see those three files (one for each country), you’re golden.
If you’re new to AWS Glue, get up to speed with their intro guides.
Configuring Snowflake IAM Role and Integration
That covers moving files into the cloud, but Snowflake needs its own permission to read from your S3 bucket. Here’s how you grant it.
Step 1: Create Snowflake’s IAM Role in AWS
- Log in to AWS Console, search for IAM, and click Roles.
- Click Create role.
- Select AWS Account as the trusted entity.
- Check the box for Require external ID and enter a placeholder like “0000”.
- Attach AmazonS3FullAccess.
- Name it something easy to remember, like
snowflake_snowpark_role
. - Create the role.
What’s this “external ID” about? It’s an extra security step so only Snowflake (as a third party) can use this role.
Step 2: Set Up Your Snowflake Integration
Now pop over to your Snowflake account. You’ll build a database and four schemas.
- Database:
snowpark_db
- Schemas:
staging
: Where files land first.raw
: Holds the untouched but structured data.transformed
: Houses cleaned, standardized records.curated
: Ready-for-reporting tables and summaries.
Here’s an example workflow:
CREATE OR REPLACE DATABASE SNOWPARK_DB; USE DATABASE SNOWPARK_DB; CREATE SCHEMA STAGING; CREATE SCHEMA RAW; CREATE SCHEMA TRANSFORMED; CREATE SCHEMA CURATED;
Next, create tables within your staging schema that mirror the structure of your files. For JSON and Parquet, use the VARIANT
data type.
- India (CSV): Table with VARCHAR columns for each field.
- USA (Parquet): Table with a
VARIANT
column for raw Parquet. - France (JSON): Table with a
VARIANT
column for JSON.
Once your schemas and tables are set, it’s time to hook up Snowflake to S3 so it can pull the files.
Storage Integration
- Get your AWS
snowflake_snowpark_role
ARN from AWS Console. - Get your S3 bucket URI.
- In Snowflake, create a storage integration:
CREATE OR REPLACE STORAGE INTEGRATION SNOWPARK_INT TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = 'S3' ENABLED = TRUE STORAGE_AWS_ROLE_ARN = '<your-snowflake-snowpark-role-arn>' STORAGE_ALLOWED_LOCATIONS = ('s3://your-bucket-name/');
- Run
DESC INTEGRATION SNOWPARK_INT;
to get the Snowflake IAM user ARN and the generated external ID. - Go back to AWS and update the trust policy for your IAM role:
- Replace the placeholder external ID with the one from Snowflake.
- Add the Snowflake IAM user ARN to the policy.
Mutual trust is mandatory; Snowflake won’t be able to read your S3 files without it.
Create the External Stage
CREATE OR REPLACE STAGE SNOWPARK_STAGE URL='s3://your-bucket-name/' STORAGE_INTEGRATION=SNOWPARK_INT;
Test your connection. List the files you uploaded:
LIST @SNOWPARK_STAGE;
If you see your three country files appear, your integration is working.
Write and Run Your Snowpark Data Pipelines
This is where the real power of Snowpark kicks in. You’ll write Python code that lives inside Snowflake, so your transformations scale as needed.
Loading Data from S3 into Staging and Raw Tables
- Open a Python worksheet in Snowflake.
- Establish your database and schema.
- Truncate staging tables to start fresh.
- Use
COPY INTO
to pull data from S3 stage into each country’s staging table. - For Parquet/JSON tables (USA and France), flatten the VARIANT columns to turn nested stuff into clean, tabular rows.
- Write the cleaned-up records into the corresponding tables in the raw schema. Use overwrite mode for simplicity.
Check by looking in the raw schema for newly filled tables.
Transform Data for Analytics
Time to standardize everything. Here’s what happens in the Python Snowpark script:
- Reorder columns to make future unions go smoothly.
- Rename columns like
GST
toTax
, andMobile
toContact Number
for clarity. - Add a
Country
field so you know the origin of every row. - Fill missing values in promo codes with
"NA"
to prevent null trouble. - Split packed columns (like
mobile_model
, which might look likeApple-iPhone13-Blue-4GB-128GB
) into separate columns for brand, model, color, RAM, and memory. - Insert a timestamp (
insert_dts
) to track when data landed.
Once transformations are done, you union all three datasets into a new table in the transformed schema called global_sales_order
. This is your unified, clean master sales table.
Build Your Curated Data Tables for Reporting
Now, create three new summaries with fresh Python Snowpark notebooks:
- Global Sales Delivered: Includes only rows where
shipping_status
is “Delivered”. - Global Sales by Brand and Model: Aggregates sales totals by mobile brand and model.
- Global Sales by Country, Year, and Quarter: Slices data to show sales volume and totals for each country per quarter.
Within each script, you’ll use Snowpark’s DataFrame API for filtering, grouping, and aggregation.
Examples:
- Use
.filter()
or.where()
to pull delivered orders. - Use
.groupBy()
with.sum()
to get total sales. - Use date functions to create
Year
andQuarter
columns, then sum sales and volumes accordingly.
After writing each DataFrame, save it as a new table in the curated schema.
Check your curated schema for the three new tables. Preview them right in Snowflake—now, you have analytics-ready datasets.
What Have You Accomplished?
This project covered the whole chain, from file arrival to business-ready reports. Here’s what you now know how to do:
- Design a cross-format ETL pipeline with AWS Glue and Snowflake Snowpark
- Set up IAM roles and permissions to control data access securely
- Create layered Snowflake storage—staging, raw, transformed, curated
- Transform and aggregate data using Snowpark Python, right inside your data warehouse
- Link AWS and Snowflake for direct S3-stage reads, keeping everything cloud-native
By building this, you’re not just learning tools—you’re creating a proven, resume-ready data engineering project. You’ll be ready to add this to your portfolio, showing off real-world skills that hiring managers want.
Here’s what you can highlight:
- Multinational ETL automation
- Cloud data warehouse integration
- Hands-on Snowpark scripting
- IAM and security best practices
- End-to-end analytics pipeline
Extra Resources to Keep Learning
Level up your skills with these links:
- AWS Glue documentation for beginners
- Snowflake Snowpark developer guide
- Snowflake Storage Integrations docs
- AWS IAM best practices
And if you’re ready to take your data engineering career up a notch, check out the Data Engineer Academy’s resources for project ideas and learning paths.
Get your hands dirty with this guide, and you’ll be ready to automate, clean, and analyze sales data no matter where—or in what shape—it arrives.
Real stories of student success

Student TRIPLES Salary with Data Engineer Academy

DEA Testimonial – A Client’s Success Story at Data Engineer Academy
Frequently asked questions
Haven’t found what you’re looking for? Contact us at [email protected] — we’re here to help.
What is the Data Engineering Academy?
Data Engineering Academy is created by FAANG data engineers with decades of experience in hiring, managing, and training data engineers at FAANG companies. We know that it can be overwhelming to follow advice from reddit, google, or online certificates, so we’ve condensed everything that you need to learn data engineering while ALSO studying for the DE interview.
What is the curriculum like?
We understand technology is always changing, so learning the fundamentals is the way to go. You will have many interview questions in SQL, Python Algo and Python Dataframes (Pandas). From there, you will also have real life Data modeling and System Design questions. Finally, you will have real world AWS projects where you will get exposure to 30+ tools that are relevant to today’s industry. See here for further details on curriculum
How is DE Academy different from other courses?
DE Academy is not a traditional course, but rather emphasizes practical, hands-on learning experiences. The curriculum of DE Academy is developed in collaboration with industry experts and professionals. We know how to start your data engineering journey while ALSO studying for the job interview. We know it’s best to learn from real world projects that take weeks to complete instead of spending years with masters, certificates, etc.
Do you offer any 1-1 help?
Yes, we provide personal guidance, resume review, negotiation help and much more to go along with your data engineering training to get you to your next goal. If interested, reach out to [email protected]
Does Data Engineering Academy offer certification upon completion?
Yes! But only for our private clients and not for the digital package as our certificate holds value when companies see it on your resume.
What is the best way to learn data engineering?
The best way is to learn from the best data engineering courses while also studying for the data engineer interview.
Is it hard to become a data engineer?
Any transition in life has its challenges, but taking a data engineer online course is easier with the proper guidance from our FAANG coaches.
What are the job prospects for data engineers?
The data engineer job role is growing rapidly, as can be seen by google trends, with an entry level data engineer earning well over the 6-figure mark.
What are some common data engineer interview questions?
SQL and data modeling are the most common, but learning how to ace the SQL portion of the data engineer interview is just as important as learning SQL itself.