When many developers build their first RAG system, the pipeline usually looks simple. Load documents → split into chunks → create embeddings → store in a vector database.
This works well in demos. But once real documents enter the system - financial reports, research papers, contracts, PDFs with tables and images - the answers start becoming unreliable.
Most of the time the issue is not the LLM. The problem is how the documents are ingested.
A Typical RAG Ingestion Pipeline
Document
↓
Document Normalization
↓
Layout Extraction
↓
Chunking
↓
Embeddings
↓
Vector Database
Each stage plays a role in making the information easier for retrieval.
1. Document Normalization
Documents arrive in many formats: PDF, DOCX, HTML, scanned images, PowerPoint slides. Without normalization, you might end up with broken text, duplicate headers on every page, or encoding issues that corrupt embeddings.
Normalization converts messy real-world documents into clean, consistent text that chunking strategies can work with reliably.
Common normalization tasks
Remove repeated headers and footers
Many PDFs repeat "Page 5 of 20" or company logos on every page.
These add noise to chunks and waste embedding space.
Fix encoding issues
Some PDFs use custom fonts or encodings that break text extraction.
Normalization detects and fixes these before chunking.
Merge hyphenated words
PDFs often split words across lines: "docu-\nment" becomes "document".
Preserve document structure
Keep section boundaries, page numbers as metadata, not inline text.
Example with PyMuPDF
import fitz
import re
def normalize_pdf(pdf_path):
doc = fitz.open(pdf_path)
text = ""
for page in doc:
page_text = page.get_text()
# Remove page numbers
page_text = re.sub(r'Page \d+ of \d+', '', page_text)
# Fix hyphenated words
page_text = re.sub(r'(\w+)-\n(\w+)', r'\1\2', page_text)
text += page_text
return text.strip()
2. Fixed Length Chunking
The simplest strategy is splitting text by size. For example, every 500 tokens becomes one chunk.
This is useful for quick prototypes, but it has a major problem. It ignores document structure completely.
A chunk might start in the middle of a paragraph and end halfway through a table. This breaks context and makes retrieval less accurate.
When to use it
Fixed length chunking works fine for simple documents like plain text articles or blog posts. But for complex documents with tables, lists, or sections, it often fails.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=500,
chunk_overlap=50 # Overlap helps preserve context at boundaries
)
chunks = splitter.split_text(document_text)
3. Recursive Chunking
Recursive chunking is smarter than fixed length chunking. It tries to respect document structure by splitting at natural boundaries.
Here's how it works:
First, it tries to split by paragraphs (double newlines).
If a paragraph is still too large, it splits by sentences (periods).
If a sentence is still too large, it splits by words.
This way, chunks are more likely to contain complete thoughts rather than cutting mid-sentence.
Why it's better
When you retrieve a chunk, it contains a full idea instead of broken fragments. This helps the LLM generate better answers.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=["\n\n", "\n", ". ", " ", ""] # Try these in order
)
chunks = splitter.split_text(document_text)
4. Semantic Chunking
Semantic chunking is the most intelligent approach. Instead of splitting by size or structure, it splits by meaning.
Here's the idea:
The system converts each sentence into an embedding (a vector representation).
It then measures similarity between consecutive sentences.
When similarity drops significantly, it means the topic has changed.
That's where a new chunk begins.
Why this matters
Imagine a document that discusses "revenue" in one paragraph and then switches to "employee benefits" in the next.
Semantic chunking detects this topic shift and creates a boundary. This keeps related information together, even if paragraphs are short or long.
The tradeoff
Semantic chunking is slower because it requires embedding every sentence. But for complex documents, the improved retrieval quality is often worth it.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = document_text.split(".")
embeddings = model.encode(sentences)
# Calculate similarity between consecutive sentences
# Split when similarity drops below threshold
5. Parent–Child (Hierarchical) Chunking
This is one of the most effective strategies for production RAG systems. It solves a common problem: small chunks lose context.
The problem with small chunks
Small chunks (200-300 tokens) are great for precise retrieval. But they often lack surrounding context.
For example, a chunk might say "The policy was updated in Q3." But which policy? Without context, the LLM cannot answer properly.
How parent-child chunking works
First, split the document into large parent sections (1000-1500 tokens).
Then split each parent into smaller child chunks (200-300 tokens).
During retrieval:
The system searches using the small child chunks (precise matching).
But it returns the entire parent section to the LLM (full context).
Why this works
You get the best of both worlds. Precise retrieval from small chunks, but the LLM sees the full context.
Parent Section (1500 tokens)
├ Child chunk 1 (300 tokens)
├ Child chunk 2 (300 tokens)
├ Child chunk 3 (300 tokens)
└ Child chunk 4 (300 tokens)
# Search using child chunks
# Return parent section to LLM
6. Layout Aware Parsing
Many PDFs are not just text. They contain tables, images, charts, multi-column layouts, and text boxes.
If you extract these as plain text, the structure gets destroyed. A table becomes a mess of words with no clear meaning.
What layout-aware parsing does
It detects different elements in the document:
• Text blocks
• Tables
• Images
• Headers and footers
• Multi-column sections
Each element is extracted separately and processed differently. Tables stay as tables. Images are described. Text flows naturally.
Tools for layout parsing
Unstructured - Open source, works well for most PDFs
Azure Document Intelligence - Cloud service, very accurate
Amazon Textract - AWS service, good for forms and tables
PyMuPDF - Fast and lightweight, basic layout detection
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("report.pdf")
for e in elements:
if e.category == "Table":
# Handle table separately
elif e.category == "Image":
# Extract and describe image
else:
# Regular text
7. Handling Tables
Tables are one of the hardest parts of document ingestion. If you convert a table to plain text, it becomes unreadable.
The problem with flattening tables
Imagine this table:
| Year | Revenue | Profit |
|------|---------|--------|
| 2023 | 5M | 1.2M |
| 2024 | 7M | 2.1M |
If you flatten it to text, you get: "Year Revenue Profit 2023 5M 1.2M 2024 7M 2.1M"
The LLM cannot understand which number belongs to which column.
Better approach: Structured text
Convert tables into structured text that preserves relationships:
Year: 2023
Revenue: 5M
Profit: 1.2M
Year: 2024
Revenue: 7M
Profit: 2.1M
Now the LLM can clearly see that 5M revenue belongs to 2023.
Tools for table extraction
LlamaParse - Specifically built for complex tables
Azure Document Intelligence - Excellent table detection
Amazon Textract - Good for forms and structured tables
Unstructured - Works for simple tables
PyMuPDF - Fast but less accurate
8. Handling Images
Images cannot be embedded directly with text embeddings. They must first be converted into text descriptions or vision embeddings.
Approach 1: Vision Language Models (Recommended)
Modern vision-language models like GPT-4o, Claude, Qwen or Gemini can generate detailed descriptions of images including charts, diagrams, and screenshots.
from openai import OpenAI
import base64
client = OpenAI()
with open("chart.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail for a RAG system."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
]
}]
)
image_description = response.choices[0].message.content
Approach 2: OCR for Text-Heavy Images
For scanned documents or images with text, OCR extracts the text directly.
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("scan.png"))
Approach 3: Multimodal Embeddings
Some embedding models support both text and images in the same vector space. This allows direct image-to-text retrieval.
from sentence_transformers import SentenceTransformer
from PIL import Image
model = SentenceTransformer('clip-ViT-B-32')
image = Image.open("diagram.png")
image_embedding = model.encode(image)
text_embedding = model.encode("technical architecture diagram")
9. Metadata Enriched Ingestion
Metadata is extra information about each chunk that helps during retrieval. It acts like filters or tags.
Why metadata matters
Imagine you have 1000 documents in your vector database. A user asks: "What was the revenue policy in 2024?"
Without metadata, the system searches through all 1000 documents. With metadata, it can filter to only documents from 2024 in the "finance" category.
This makes retrieval faster and more accurate.
Common metadata fields
source - Which document this chunk came from
date - When the document was created
section - Which section of the document
page_number - For reference
document_type - Policy, report, email, etc.
{
"text": "The revenue policy was updated...",
"metadata": {
"source": "finance_policy_2024.pdf",
"year": 2024,
"section": "taxation",
"page": 12,
"document_type": "policy"
}
}
During retrieval, you can filter: "Find chunks where year=2024 AND section='taxation'"
Final Thoughts
Many discussions about RAG focus on models and embeddings. But in practice, the biggest improvements often come from better document ingestion.
If document structure is broken during ingestion, even the best LLM cannot recover the lost context.
A well designed ingestion pipeline usually leads to better retrieval and more reliable answers.