
How to Build a RAG System Companies Actually Use (Data Engineering View)
Most people try to break into GenAI work by collecting tools like trophies. That doesn’t hold up in interviews.
A production RAG system is mostly data engineering: ingesting messy documents, keeping them updated, retrieving the right context fast, and only then calling an LLM. This post breaks down what companies actually need, using a real Slack bot style example, plus the system design choices that matter.
Key takeaways
- A RAG system answers questions using your internal docs, by retrieving relevant context before calling an LLM.
- Data engineering in AI means building ingestion, storage, and transformation for unstructured data at scale.
- Keyword search is not enough, you need embeddings and similarity search to retrieve the right chunks.
- Always use a landing zone (like S3 or ADLS) to decouple ingestion from downstream steps.
- Incremental ingestion is the real test, because your docs change every week (or every day).
From $60K to $450K: why projects beat “tool collecting”
Christopher Garzon (Data Engineer Academy) describes a career path many data engineers want: going from about $60K per year to roughly $450K per year in under five years, with stops at Amazon, Lyft, and startups.
The pattern wasn’t “learn every tool.” It was closer to this:
- Build projects that look like real work.
- Use those projects to explain business impact.
- Switch companies when the skill story is strong.
Each move came with big comp jumps (often 20 to 60 percent more). That only happens when you can explain what you built, why it mattered, and how it ran in production.
If you can’t explain the pipeline, the tradeoffs, and the “why,” the tool names won’t save you.
What does data engineering in AI mean (and what it doesn’t)?
Data engineering in AI means building the pipelines that collect, clean, store, and serve large amounts of data so AI systems can use it reliably. It’s not “using AI to do data engineering.” Instead, it’s the infrastructure behind AI features, especially when the data is unstructured (docs, PDFs, text, images).
A helpful way to separate the ideas:
- AI in data engineering: using AI to speed up DE tasks (coding help, testing, docs).
- Data engineering in AI: feeding AI systems with the right data, in the right form, on the right schedule.
Companies now ask for “GenAI experience,” but what they often mean is: can you process unstructured data, run incremental ingestion, and support retrieval systems that don’t fall apart in production?
What is a RAG system (in plain English)?
A RAG system is a question answering setup that combines an LLM with your company’s private data. Instead of expecting the model to “already know” your internal docs, RAG retrieves relevant pieces of those docs and sends them as context to the LLM. The LLM then answers using that provided context.
That “private data” can be almost anything:
- PDFs and Word docs
- Text files
- JSON and CSV exports
- Images (in multimodal setups)
- Slack messages (via API)
In the video example, a Slack bot (Support Genie) answers SQL and Python questions using Data Engineer Academy content. The key point is simple: public ChatGPT won’t know your internal portal, docs, or past support answers, so you have to supply them.
RAG vs AI agents (don’t mix them up)
An agent is a workflow that takes actions. RAG is retrieval plus generation. You ask a question, it fetches context, then it returns an answer.
Keeping that distinction clear helps in system design interviews, because the components and failure modes differ.
How does RAG work step by step?
RAG works by converting the user’s question into a search-friendly representation, retrieving the most relevant document chunks, and then passing those chunks into the LLM as context. The model does not “learn” your docs in that moment, it simply uses retrieved context to answer. This is why retrieval quality and data freshness matter.
A practical, interview-ready flow looks like this:
- User query comes in (Slack, web app, internal tool).
- Query processing prepares the request (often includes embeddings).
- Retriever fetches the most relevant chunks from storage.
- Context builder formats those chunks into a prompt.
- LLM generates an answer using that context.
- Response returns to the user, often with logging and monitoring.
The hard part starts when your internal data is not “10 PDFs you never update,” it’s thousands of changing files, messages, and pages.
Why keyword search fails at scale (and embeddings win)
Keyword search fails because humans and machines don’t match meaning the same way. A word like “where” could refer to SQL, a policy doc, or an HR guide. If you pull every document containing that keyword, you drown the LLM in noise. Embeddings fix this by searching for semantic similarity, not exact terms.
In older chatbots, keywords were the main approach. With modern RAG, the system usually uses embeddings, which are numeric vectors that represent meaning. Both the question and each document chunk get converted into vectors, then the system finds the closest matches.
A classic illustration is that “king” and “boy” may land near each other in vector space, while unrelated words land farther away. The same idea applies to your company docs, except at much larger scale.
Why vector databases matter for unstructured RAG data
Vector databases store embeddings and support fast similarity search. They can also store metadata (topic, link, timestamp, doc path) so you can filter and debug retrieval. While platforms like Snowflake and Databricks can store vectors, purpose-built vector stores often fit RAG workloads better, especially on cost and retrieval behavior.
In the demo, Pinecone is used as the vector database. The stored record includes:
- An ID
- The chunk text
- A numeric vector embedding
- Metadata such as link, topic, subject, modified date, and path
That metadata becomes important when users paste a link, ask without a link, or when you need to trace why the system retrieved the wrong chunk.
The data engineering pipeline behind RAG (the part most tutorials skip)
Most RAG demos start after the data magically exists in a vector database. Production systems start earlier.
If you frame this as a data engineering problem, it becomes familiar:
- Extract from a source (Drive, Slack API, internal CMS)
- Load into a landing zone
- Transform (cleaning, chunking, embeddings)
- Load into your vector database
- Repeat incrementally
Start with system design basics: source, compute, landing zone
You need to answer three questions early:
- Where is the data now? (Google Drive, Slack, S3, Confluence, internal DB)
- How will you extract it? (API, connector, scheduled job)
- Where will you land it first? (S3, ADLS Gen2, similar object storage)
The landing zone matters because it decouples systems. If your job fails while writing to the vector database, you still have the raw copy and can replay the pipeline.
It also matters because object storage handles basically any format. Many warehouses “support unstructured,” but object storage is still the safest universal staging area.
Picking compute for ingestion: Python has to run somewhere
In practice, ingestion is “run Python somewhere reliable.” Local scripts are fine for tests, but not for production.
The video walks through common options and their tradeoffs. Here’s a compact comparison based on what was discussed.
| Option | What it’s good for | Where it struggles |
|---|---|---|
| AWS Lambda | Small, quick tasks | Hard limit around 15 GB, not great for TB-scale ingestion |
| AWS Glue | Spark-based ingestion, serverless feel | Library installation limits (size cap), weaker ML tooling and notebook workflow for heavy GenAI transforms |
| Databricks | Notebooks, scheduling, dependency management, ML-friendly workflows | Can cost more, requires platform setup |
| Airflow (with ECS/EC2/etc.) | Orchestration flexibility | Extra ops overhead if your team doesn’t already run it |
A key point from the walkthrough: don’t “invent” tools just for one project. If your company already runs Databricks, it’s often smarter to stay there.
Incremental ingestion: the difference between a demo and a real system
Incremental ingestion means your pipeline keeps updating the vector store as new or changed docs arrive. That’s the real job. Most tutorials show a one-time load of static files, but companies add docs weekly, daily, sometimes hourly.
In the Data Engineer Academy example, new SQL questions and tutorials keep getting added. A support staff member uploads a new document to Drive. Then a scheduled job runs weekly to ingest and process the new content.
A practical incremental flow looks like this:
- New doc arrives in Drive (or Slack message arrives via API).
- Scheduled job runs (weekly in the example).
- Extract new and changed items.
- Copy to S3 as the landing zone.
- Chunk the text.
- Create embeddings (OpenAI is used in the demo).
- Upsert vectors and metadata into Pinecone.
That is data engineering in AI, because the AI feature depends on your ability to keep the data current, searchable, and traceable.
Chunking strategy: fixed, recursive, and why “semantic” can get expensive
Chunking is splitting a document into smaller pieces so retrieval stays accurate and the LLM context stays within limits. In the walkthrough, chunk size isn’t chosen randomly. It’s based on the content shape.
For example, many SQL interview questions (prompt, tables, solution, explanation) can fit into a single chunk. On the other hand, long tutorials with video transcripts cannot.
The discussed approaches:
- Fixed-size chunking: you choose a chunk size and overlap. It’s simple and common in production.
- Recursive chunking: split based on document structure and file type (code, JSON, text), often improving boundaries.
- Semantic chunking: uses an LLM to understand meaning while chunking. It can get expensive quickly because you send more content to models during preprocessing.
In the example system, chunk size is set around 500 to 700 tokens (the video also frames this as roughly a few thousand words depending on how you estimate). The important part is the method: analyze your docs, then choose a chunk strategy that matches your retrieval goals.
A production RAG demo: Slack question to Pinecone retrieval to LLM answer
The live walkthrough shows what “good retrieval” looks like.
A user asks a Slack bot to elaborate on a SQL solution from the Data Engineer Academy portal. The system:
- Logs the question for observability
- Sends the query to Pinecone as an embedding search
- Retrieves the top matches (often 3 to 5 chunks)
- Builds a prompt that includes the retrieved question and solution context
- Sends that prompt to the LLM
- Returns the final answer in Slack
The “top K” choice (3 to 5) is practical. Sometimes the best match is not ranked first, which is where re-ranking can come in later.
What’s also useful is the data model inside the vector store. Along with vectors, the system stores IDs and metadata like paths and topics, so you can trace results back to source content.
Why Databricks beat Glue and SageMaker in this setup
Databricks gets picked here for reasons that match day-to-day engineering, not hype.
Databricks vs AWS Glue
Glue can run Spark jobs, but installing lots of libraries can be painful. The walkthrough mentions a library size cap (around 250 MB), which becomes a real blocker when your GenAI preprocessing uses many dependencies.
Databricks also helps with:
- Notebook-style development
- Scheduling workflows and dependency management
- Better built-in observability for jobs
- More ML-friendly features (including embedding model options)
Databricks vs SageMaker and Bedrock
SageMaker focuses on the ML side. It’s not as strong for routine data engineering needs like incremental ingestion, table management, and orchestration.
Bedrock can be great for no-code setups, but you lose control over the ingestion and transformation details. Costs can also climb fast if you don’t control how much data you send to models.
FAQ
1) What is retrieval augmented generation (RAG)?
Retrieval augmented generation (RAG) is a pattern where an LLM answers questions using external context retrieved from your own data. Instead of relying only on what the model learned during training, RAG searches your internal documents, selects the most relevant chunks, and passes them into the prompt.
2) What does “data engineering in AI” mean for job interviews?
Data engineering in AI means you can build pipelines that prepare and serve data for AI features. In interviews, that usually means unstructured ingestion, chunking, embeddings, vector database writes, and incremental updates. Hiring teams want proof you can run this reliably, not just build a one-time demo.
3) How do vector databases help RAG systems?
Vector databases help RAG systems by storing embeddings and supporting similarity search, which finds relevant content by meaning instead of keywords. They also store metadata like source, topic, and timestamps. That mix makes retrieval faster, improves answer quality, and makes debugging possible when results look wrong.
4) Which chunking strategy should you use first for RAG?
Start with fixed-size or recursive chunking because both are practical, predictable, and easy to tune. Semantic chunking can work, but it often raises preprocessing cost since it involves more LLM calls. The best choice depends on your document types, target chunk size, and how often you reprocess content.
5) Is Data Engineer Academy a good fit if I want to build AI systems?
Data Engineer Academy can be a fit if you want hands-on projects that mirror production pipelines, including ingestion, transformation, and system design. The key is building work you can explain in interviews.
Conclusion
A RAG system that companies actually use isn’t just an LLM with a fancy UI. It’s a data engineering pipeline that keeps unstructured data fresh, searchable, and traceable, then retrieves the right context at the right time.
If you take one thing from this breakdown, make it this: focus on incremental ingestion and retrieval quality, because that’s where real systems succeed or fail. Build that as a project, then use it as your interview story.

