🖼️ ColPali Integration¶

Overview¶

This module provides a complete pipeline for processing PDF documents with ColPali embeddings, storing them in a Milvus vector database, and performing semantic search.

It is designed for efficient document retrieval and RAG applications.

🧭 Architecture¶

The system consists of three main components:

PDF Processor - Extracts embeddings from PDF pages
Milvus Indexer - Stores and indexes embeddings
Retriever - Performs semantic search queries

📁 File Structure¶

src/mmore/colpali/
├── milvuscolpali.py      # Milvus database management
├── run_index.py          # Indexing pipeline
├── run_process.py        # PDF processing pipeline  
├── run_retriever.py      # Search and retrieval API
└── retriever.py          # ColPaliRetriever class for RAG integration

🚀 Quick Start¶

1. Process PDFs into embeddings¶

python3 -m mmore colpali process --config-file examples/colpali/config_process.yml

Example config (config_process.yml):

data_path:
  - 'examples/sample_data/pdf'
output_path: "./output"
model_name: "vidore/colpali-v1.3"
skip_already_processed: true
num_workers: 5
batch_size: 8

2. Index embeddings into Milvus¶

python3 -m mmore colpali index --config-file examples/colpali/config_index.yml

Example config (config_index.yml):

parquet_path: ./output/pdf_page_objects.parquet
milvus:
    db_path: ./output/milvus_data.db
    collection_name: pdf_pages
    create_collection: true
    dim: 128
    metric_type: IP

3. Run Retrieval¶

Retrieval Server Mode¶

# Start the retrieval API server
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml

Or with a custom host and port:

python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml --host 0.0.0.0 --port 8001

Example config (config_retrieval.yml):

db_path: "./milvus_data"
collection_name: "pdf_pages"
model_name: "vidore/colpali-v1.3"
top_k: 3
dim: 128
max_workers: 16
metric_type: "IP"
text_parquet_path: "./output/pdf_page_text.parquet"

Single Query Mode¶

# Run retrieval for a single query defined in the config file
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval_single.yml

Example config (config_retrieval_single.yml):

mode: "single"
db_path: "./milvus_data"
collection_name: "pdf_pages"
model_name: "vidore/colpali-v1.3"
query: "What may lead to dysbiosis and inflammation?"
top_k: 5

Host and port are specified via CLI flags (--host and --port), not in the config file.

Batch Mode¶

# Process queries from file
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml --input-file queries.jsonl --output-file results.json

Example queries file (queries.jsonl): Each line should be a JSON-encoded string (one query per line):

"machine learning"
"neural networks"
"data processing"

Each line must be a valid JSON string, including quotes, since the file is parsed line by line with json.loads().

Example config (config_retrieval.yml):

db_path: "./milvus_data"
collection_name: "pdf_pages"
model_name: "vidore/colpali-v1.3"
top_k: 5
dim: 128
max_workers: 16
text_parquet_path: "./output/pdf_page_text.parquet"

🔧 Core Components¶

MilvusColpaliManager¶

manages local Milvus database operations
handles collection creation and indexing
provides efficient batch insertion
implements hybrid search with reranking

Key Features:

local Milvus instance with no external dependencies
automatic collection management
multi-vector support for pages
efficient batch operations

PDF Processor¶

converts PDF pages to images
generates ColPali embeddings
handles parallel processing
supports stop-and-resume workflows for large datasets

Processing Flow:

Crawl PDF files from specified directories
Convert each page to high-resolution PNG
Generate embeddings using ColPali model
Store results in Parquet format

Retriever¶

supports multiple usage modes: server mode by default, single-query mode via config, or batch mode with --input-file and --output-file
performs fast semantic search with reranking
exposes a REST API for integration
supports configurable top-k results
provides a LangChain-compatible BaseRetriever for RAG integration
can retrieve page text through the text_parquet_path configuration

🎯 Use Cases¶

Document Retrieval¶

# Example API call
curl -X POST "http://localhost:8001/v1/retrieve" \
     -H "Content-Type: application/json" \
     -d '{"query": "machine learning", "top_k": 3}'

Response format:

{
  "query": "machine learning",
  "results": [
    {
      "pdf_name": "ml_book.pdf",
      "pdf_path": "/path/to/ml_book.pdf",
      "page_number": 42,
      "content": "Machine learning is a subset of artificial intelligence...",
      "similarity": 0.894,
      "rank": 1
    }
  ]
}

RAG Pipeline Integration¶

from mmore.colpali.retriever import ColPaliRetriever, ColPaliRetrieverConfig
from mmore.rag.pipeline import RAGPipeline, RAGConfig

# Create ColPali retriever with text support
colpali_config = ColPaliRetrieverConfig(
    db_path="./output/milvus_data.db",
    collection_name="pdf_pages",
    model_name="vidore/colpali-v1.3",
    text_parquet_path="./output/pdf_page_text.parquet",
    top_k=3,
    dim=128,
    max_workers=16,
    metric_type="IP",
)
colpali_retriever = ColPaliRetriever.from_config(colpali_config)

# Use with RAG pipeline (requires LLM config)
# rag_config = RAGConfig(retriever=colpali_retriever, ...)
# rag_pipeline = RAGPipeline.from_config(rag_config)

The ColPaliRetriever is a LangChain-compatible BaseRetriever that returns Document objects with:

page_content: the text content from the PDF page, if text_parquet_path is provided
metadata: contains pdf_name, pdf_path, page_number, rank, and similarity score

📦 Output Formats¶

Process Output¶

Embeddings Parquet (pdf_page_objects.parquet)

{
  "pdf_path": "/path/to/doc1.pdf",
  "page_number": 1,
  "embedding": [0.1, 0.2, "..."]
}

Text Mapping Parquet (pdf_page_text.parquet)

{
  "pdf_path": "/path/to/doc1.pdf",
  "page_number": 1,
  "text": "Page content text here..."
}

Search Results¶

API Response:

{
  "query": "machine learning",
  "results": [
    {
      "pdf_name": "ml_book.pdf",
      "pdf_path": "/path/to/ml_book.pdf",
      "page_number": 42,
      "content": "Machine learning is a subset of artificial intelligence...",
      "similarity": 0.894,
      "rank": 1
    }
  ]
}

Batch Mode Output:

{
  "query": "machine learning",
  "context": [
    {
      "page_content": "Machine learning is a subset of artificial intelligence...",
      "metadata": {
        "pdf_name": "ml_book.pdf",
        "pdf_path": "/path/to/ml_book.pdf",
        "page_number": 42,
        "rank": 1,
        "similarity": 0.894
      }
    }
  ]
}

🔁 Pipeline Example¶

Complete Workflow¶

# 1. Process all PDFs in a directory
python3 -m mmore colpali process --config-file examples/colpali/config_process.yml

# 2. Index the embeddings
python3 -m mmore colpali index --config-file examples/colpali/config_index.yml

# 3. Start the API server
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml

# 4. Query the system
curl -X POST "http://localhost:8001/v1/retrieve" \
     -H "Content-Type: application/json" \
     -d '{"query": "your search query", "top_k": 3}'

Alternative: Batch processing¶

# 1. Process PDFs (same as above)
python3 -m mmore colpali process --config-file examples/colpali/config_process.yml

# 2. Index embeddings (same as above)
python3 -m mmore colpali index --config-file examples/colpali/config_index.yml

# 3. Run batch retrieval
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml \
                       --input-file queries.jsonl \
                       --output-file results.json

💡 Configuration tips¶

For large datasets¶

increase batch_size and num_workers in process config
use skip_already_processed: true for incremental processing

For better accuracy¶

use higher DPI in PDF conversion, default is 200
increase top_k in retrieval to inspect more candidate pages
consider using larger ColPali models if available

For production¶

run Milvus in distributed mode for larger datasets
use the API mode for scalable serving
implement caching for frequent queries