Document Ingestion Strategies for RAG Systems

When many developers build their first RAG system, the pipeline usually looks simple. Load documents → split into chunks → create embeddings → store in a vector database.

This works well in demos. But once real documents enter the system - financial reports, research papers, contracts, PDFs with tables and images - the answers start becoming unreliable.

Most of the time the issue is not the LLM. The problem is how the documents are ingested.

A Typical RAG Ingestion Pipeline


Document
↓
Document Normalization
↓
Layout Extraction
↓
Chunking
↓
Embeddings
↓
Vector Database

Each stage plays a role in making the information easier for retrieval.

1. Document Normalization

Documents arrive in many formats: PDF, DOCX, HTML, scanned images, PowerPoint slides. Without normalization, you might end up with broken text, duplicate headers on every page, or encoding issues that corrupt embeddings.

Normalization converts messy real-world documents into clean, consistent text that chunking strategies can work with reliably.

Common normalization tasks

Remove repeated headers and footers
Many PDFs repeat "Page 5 of 20" or company logos on every page. These add noise to chunks and waste embedding space.

Fix encoding issues
Some PDFs use custom fonts or encodings that break text extraction. Normalization detects and fixes these before chunking.

Merge hyphenated words
PDFs often split words across lines: "docu-\nment" becomes "document".

Preserve document structure
Keep section boundaries, page numbers as metadata, not inline text.

Example with PyMuPDF


import fitz
import re

def normalize_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    
    for page in doc:
        page_text = page.get_text()
        
        # Remove page numbers
        page_text = re.sub(r'Page \d+ of \d+', '', page_text)
        
        # Fix hyphenated words
        page_text = re.sub(r'(\w+)-\n(\w+)', r'\1\2', page_text)
        
        text += page_text
    
    return text.strip()

2. Fixed Length Chunking

The simplest strategy is splitting text by size. For example, every 500 tokens becomes one chunk.

This is useful for quick prototypes, but it has a major problem. It ignores document structure completely.

A chunk might start in the middle of a paragraph and end halfway through a table. This breaks context and makes retrieval less accurate.

When to use it

Fixed length chunking works fine for simple documents like plain text articles or blog posts. But for complex documents with tables, lists, or sections, it often fails.


from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50  # Overlap helps preserve context at boundaries
)

chunks = splitter.split_text(document_text)

3. Recursive Chunking

Recursive chunking is smarter than fixed length chunking. It tries to respect document structure by splitting at natural boundaries.

Here's how it works:

First, it tries to split by paragraphs (double newlines).
If a paragraph is still too large, it splits by sentences (periods).
If a sentence is still too large, it splits by words.

This way, chunks are more likely to contain complete thoughts rather than cutting mid-sentence.

Why it's better

When you retrieve a chunk, it contains a full idea instead of broken fragments. This helps the LLM generate better answers.


from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""]  # Try these in order
)

chunks = splitter.split_text(document_text)

4. Semantic Chunking

Semantic chunking is the most intelligent approach. Instead of splitting by size or structure, it splits by meaning.

Here's the idea:

The system converts each sentence into an embedding (a vector representation).
It then measures similarity between consecutive sentences.
When similarity drops significantly, it means the topic has changed.
That's where a new chunk begins.

Why this matters

Imagine a document that discusses "revenue" in one paragraph and then switches to "employee benefits" in the next.

Semantic chunking detects this topic shift and creates a boundary. This keeps related information together, even if paragraphs are short or long.

The tradeoff

Semantic chunking is slower because it requires embedding every sentence. But for complex documents, the improved retrieval quality is often worth it.


from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = document_text.split(".")
embeddings = model.encode(sentences)

# Calculate similarity between consecutive sentences
# Split when similarity drops below threshold

5. Parent–Child (Hierarchical) Chunking

This is one of the most effective strategies for production RAG systems. It solves a common problem: small chunks lose context.

The problem with small chunks

Small chunks (200-300 tokens) are great for precise retrieval. But they often lack surrounding context.

For example, a chunk might say "The policy was updated in Q3." But which policy? Without context, the LLM cannot answer properly.

How parent-child chunking works

First, split the document into large parent sections (1000-1500 tokens).
Then split each parent into smaller child chunks (200-300 tokens).

During retrieval:

The system searches using the small child chunks (precise matching).
But it returns the entire parent section to the LLM (full context).

Why this works

You get the best of both worlds. Precise retrieval from small chunks, but the LLM sees the full context.


Parent Section (1500 tokens)
   ├ Child chunk 1 (300 tokens)
   ├ Child chunk 2 (300 tokens)
   ├ Child chunk 3 (300 tokens)
   └ Child chunk 4 (300 tokens)

# Search using child chunks
# Return parent section to LLM

6. Layout Aware Parsing

Many PDFs are not just text. They contain tables, images, charts, multi-column layouts, and text boxes.

If you extract these as plain text, the structure gets destroyed. A table becomes a mess of words with no clear meaning.

What layout-aware parsing does

It detects different elements in the document:

• Text blocks
• Tables
• Images
• Headers and footers
• Multi-column sections

Each element is extracted separately and processed differently. Tables stay as tables. Images are described. Text flows naturally.

Tools for layout parsing

Unstructured - Open source, works well for most PDFs
Azure Document Intelligence - Cloud service, very accurate
Amazon Textract - AWS service, good for forms and tables
PyMuPDF - Fast and lightweight, basic layout detection


from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("report.pdf")

for e in elements:
    if e.category == "Table":
        # Handle table separately
    elif e.category == "Image":
        # Extract and describe image
    else:
        # Regular text

7. Handling Tables

Tables are one of the hardest parts of document ingestion. If you convert a table to plain text, it becomes unreadable.

The problem with flattening tables

Imagine this table:


| Year | Revenue | Profit |
|------|---------|--------|
| 2023 | 5M      | 1.2M   |
| 2024 | 7M      | 2.1M   |

If you flatten it to text, you get: "Year Revenue Profit 2023 5M 1.2M 2024 7M 2.1M"

The LLM cannot understand which number belongs to which column.

Better approach: Structured text

Convert tables into structured text that preserves relationships:


Year: 2023
Revenue: 5M
Profit: 1.2M

Year: 2024
Revenue: 7M
Profit: 2.1M

Now the LLM can clearly see that 5M revenue belongs to 2023.

Tools for table extraction

LlamaParse - Specifically built for complex tables
Azure Document Intelligence - Excellent table detection
Amazon Textract - Good for forms and structured tables
Unstructured - Works for simple tables
PyMuPDF - Fast but less accurate

8. Handling Images

Images cannot be embedded directly with text embeddings. They must first be converted into text descriptions or vision embeddings.

Approach 1: Vision Language Models (Recommended)

Modern vision-language models like GPT-4o, Claude, Qwen or Gemini can generate detailed descriptions of images including charts, diagrams, and screenshots.


from openai import OpenAI
import base64

client = OpenAI()

with open("chart.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image in detail for a RAG system."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
        ]
    }]
)

image_description = response.choices[0].message.content

Approach 2: OCR for Text-Heavy Images

For scanned documents or images with text, OCR extracts the text directly.


import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open("scan.png"))

Approach 3: Multimodal Embeddings

Some embedding models support both text and images in the same vector space. This allows direct image-to-text retrieval.


from sentence_transformers import SentenceTransformer
from PIL import Image

model = SentenceTransformer('clip-ViT-B-32')

image = Image.open("diagram.png")
image_embedding = model.encode(image)

text_embedding = model.encode("technical architecture diagram")

9. Metadata Enriched Ingestion

Metadata is extra information about each chunk that helps during retrieval. It acts like filters or tags.

Why metadata matters

Imagine you have 1000 documents in your vector database. A user asks: "What was the revenue policy in 2024?"

Without metadata, the system searches through all 1000 documents. With metadata, it can filter to only documents from 2024 in the "finance" category.

This makes retrieval faster and more accurate.

Common metadata fields

source - Which document this chunk came from
date - When the document was created
section - Which section of the document
page_number - For reference
document_type - Policy, report, email, etc.


{
  "text": "The revenue policy was updated...",
  "metadata": {
    "source": "finance_policy_2024.pdf",
    "year": 2024,
    "section": "taxation",
    "page": 12,
    "document_type": "policy"
  }
}

During retrieval, you can filter: "Find chunks where year=2024 AND section='taxation'"

Final Thoughts

Many discussions about RAG focus on models and embeddings. But in practice, the biggest improvements often come from better document ingestion.

If document structure is broken during ingestion, even the best LLM cannot recover the lost context.

A well designed ingestion pipeline usually leads to better retrieval and more reliable answers.