🖼️ ColPali Integration¶
Overview¶
This module provides a complete pipeline for processing PDF documents with ColPali embeddings, storing them in a Milvus vector database, and performing semantic search.
It is designed for efficient document retrieval and RAG applications.
🧭 Architecture¶
The system consists of three main components:
PDF Processor - Extracts embeddings from PDF pages
Milvus Indexer - Stores and indexes embeddings
Retriever - Performs semantic search queries
📁 File Structure¶
src/mmore/colpali/
├── milvuscolpali.py # Milvus database management
├── run_index.py # Indexing pipeline
├── run_process.py # PDF processing pipeline
├── run_retriever.py # Search and retrieval API
└── retriever.py # ColPaliRetriever class for RAG integration
🚀 Quick Start¶
1. Process PDFs into embeddings¶
python3 -m mmore colpali process --config-file examples/colpali/config_process.yml
Example config (config_process.yml):
data_path:
- 'examples/sample_data/pdf'
output_path: "./output"
model_name: "vidore/colpali-v1.3"
skip_already_processed: true
num_workers: 5
batch_size: 8
2. Index embeddings into Milvus¶
python3 -m mmore colpali index --config-file examples/colpali/config_index.yml
Example config (config_index.yml):
parquet_path: ./output/pdf_page_objects.parquet
milvus:
db_path: ./output/milvus_data.db
collection_name: pdf_pages
create_collection: true
dim: 128
metric_type: IP
3. Run Retrieval¶
Retrieval Server Mode¶
# Start the retrieval API server
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml
Or with a custom host and port:
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml --host 0.0.0.0 --port 8001
Example config (config_retrieval.yml):
db_path: "./milvus_data"
collection_name: "pdf_pages"
model_name: "vidore/colpali-v1.3"
top_k: 3
dim: 128
max_workers: 16
metric_type: "IP"
text_parquet_path: "./output/pdf_page_text.parquet"
Single Query Mode¶
# Run retrieval for a single query defined in the config file
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval_single.yml
Example config (config_retrieval_single.yml):
mode: "single"
db_path: "./milvus_data"
collection_name: "pdf_pages"
model_name: "vidore/colpali-v1.3"
query: "What may lead to dysbiosis and inflammation?"
top_k: 5
Host and port are specified via CLI flags (--host and --port), not in the config file.
Batch Mode¶
# Process queries from file
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml --input-file queries.jsonl --output-file results.json
Example queries file (queries.jsonl):
Each line should be a JSON-encoded string (one query per line):
"machine learning"
"neural networks"
"data processing"
Each line must be a valid JSON string, including quotes, since the file is parsed line by line with json.loads().
Example config (config_retrieval.yml):
db_path: "./milvus_data"
collection_name: "pdf_pages"
model_name: "vidore/colpali-v1.3"
top_k: 5
dim: 128
max_workers: 16
text_parquet_path: "./output/pdf_page_text.parquet"
🔧 Core Components¶
MilvusColpaliManager¶
manages local Milvus database operations
handles collection creation and indexing
provides efficient batch insertion
implements hybrid search with reranking
Key Features:
local Milvus instance with no external dependencies
automatic collection management
multi-vector support for pages
efficient batch operations
PDF Processor¶
converts PDF pages to images
generates ColPali embeddings
handles parallel processing
supports stop-and-resume workflows for large datasets
Processing Flow:
Crawl PDF files from specified directories
Convert each page to high-resolution PNG
Generate embeddings using ColPali model
Store results in Parquet format
Retriever¶
supports multiple usage modes: server mode by default, single-query mode via config, or batch mode with
--input-fileand--output-fileperforms fast semantic search with reranking
exposes a REST API for integration
supports configurable top-k results
provides a LangChain-compatible
BaseRetrieverfor RAG integrationcan retrieve page text through the
text_parquet_pathconfiguration
🎯 Use Cases¶
Document Retrieval¶
# Example API call
curl -X POST "http://localhost:8001/v1/retrieve" \
-H "Content-Type: application/json" \
-d '{"query": "machine learning", "top_k": 3}'
Response format:
{
"query": "machine learning",
"results": [
{
"pdf_name": "ml_book.pdf",
"pdf_path": "/path/to/ml_book.pdf",
"page_number": 42,
"content": "Machine learning is a subset of artificial intelligence...",
"similarity": 0.894,
"rank": 1
}
]
}
RAG Pipeline Integration¶
from mmore.colpali.retriever import ColPaliRetriever, ColPaliRetrieverConfig
from mmore.rag.pipeline import RAGPipeline, RAGConfig
# Create ColPali retriever with text support
colpali_config = ColPaliRetrieverConfig(
db_path="./output/milvus_data.db",
collection_name="pdf_pages",
model_name="vidore/colpali-v1.3",
text_parquet_path="./output/pdf_page_text.parquet",
top_k=3,
dim=128,
max_workers=16,
metric_type="IP",
)
colpali_retriever = ColPaliRetriever.from_config(colpali_config)
# Use with RAG pipeline (requires LLM config)
# rag_config = RAGConfig(retriever=colpali_retriever, ...)
# rag_pipeline = RAGPipeline.from_config(rag_config)
The ColPaliRetriever is a LangChain-compatible BaseRetriever that returns Document objects with:
page_content: the text content from the PDF page, iftext_parquet_pathis providedmetadata: containspdf_name,pdf_path,page_number,rank, andsimilarityscore
📦 Output Formats¶
Process Output¶
Embeddings Parquet (pdf_page_objects.parquet)
{
"pdf_path": "/path/to/doc1.pdf",
"page_number": 1,
"embedding": [0.1, 0.2, "..."]
}
Text Mapping Parquet (pdf_page_text.parquet)
{
"pdf_path": "/path/to/doc1.pdf",
"page_number": 1,
"text": "Page content text here..."
}
Search Results¶
API Response:
{
"query": "machine learning",
"results": [
{
"pdf_name": "ml_book.pdf",
"pdf_path": "/path/to/ml_book.pdf",
"page_number": 42,
"content": "Machine learning is a subset of artificial intelligence...",
"similarity": 0.894,
"rank": 1
}
]
}
Batch Mode Output:
{
"query": "machine learning",
"context": [
{
"page_content": "Machine learning is a subset of artificial intelligence...",
"metadata": {
"pdf_name": "ml_book.pdf",
"pdf_path": "/path/to/ml_book.pdf",
"page_number": 42,
"rank": 1,
"similarity": 0.894
}
}
]
}
🔁 Pipeline Example¶
Complete Workflow¶
# 1. Process all PDFs in a directory
python3 -m mmore colpali process --config-file examples/colpali/config_process.yml
# 2. Index the embeddings
python3 -m mmore colpali index --config-file examples/colpali/config_index.yml
# 3. Start the API server
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml
# 4. Query the system
curl -X POST "http://localhost:8001/v1/retrieve" \
-H "Content-Type: application/json" \
-d '{"query": "your search query", "top_k": 3}'
Alternative: Batch processing¶
# 1. Process PDFs (same as above)
python3 -m mmore colpali process --config-file examples/colpali/config_process.yml
# 2. Index embeddings (same as above)
python3 -m mmore colpali index --config-file examples/colpali/config_index.yml
# 3. Run batch retrieval
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml \
--input-file queries.jsonl \
--output-file results.json
💡 Configuration tips¶
For large datasets¶
increase
batch_sizeandnum_workersin process configuse
skip_already_processed: truefor incremental processing
For better accuracy¶
use higher DPI in PDF conversion, default is 200
increase
top_kin retrieval to inspect more candidate pagesconsider using larger ColPali models if available
For production¶
run Milvus in distributed mode for larger datasets
use the API mode for scalable serving
implement caching for frequent queries