βοΈ ProcessΒΆ
OverviewΒΆ
The process module enables the extraction and standardization of text and images from diverse file formats (listed below), making it ideal for creating datasets for applications such as RAG, multimodal content generation, and preprocessing data for multimodal LLMs and LLMs.
π¨Quick StartΒΆ
π©βπ» Global installationΒΆ
Set up the project on each device you want to use by following Installation.
π» Running locallyΒΆ
To run the process locally, first specify the input folders in the config file examples/process/config.yaml. You can also adjust the parameters to your needs.
Once ready, run:
python3 -m mmore process --config-file examples/process/config.yaml
π Google Drive supportΒΆ
MMORE also supports processing documents directly from Google Drive.
To enable this feature, the user must create a Google service account and download the corresponding secrets as a JSON file. Name that file client_secrets.json and put it in googledrive/ (this folder may need to be created at the root of the mmore repository).
Make sure your Google service account has permission to view the drives you want to process.
Referencing Google Drive sourcesΒΆ
Google Drive folders are referenced in the process config file through the google_drive_ids field.
For example:
data_path: examples/sample_data/ # Put absolute path ! Possible to pass a list of folders
google_drive_ids: [] # Put ids of Google Drive folders
data_pathis used for local input foldergoogle_drive_idsis used to provide one or more Google Drive folder IDs to process
To process documents from Google Drive, add the folder IDs to the list:
data_path: examples/sample_data/
google_drive_ids:
- your_google_drive_folder_id
- another_google_drive_folder_id
You can use local folders, Google Drive folders, or both in the same configuration.
Make sure each referenced Google Drive folder is shared with the service account used by MMORE.
You can find an example config file in examples/process/config.yaml.
π Output structureΒΆ
The output of the pipeline has the following structure:
output_path
βββ processors
β βββ Processor_type_1
β β βββ results.jsonl
β βββ Processor_type_2
β β βββ results.jsonl
β βββ ...
β
βββ merged
β βββ merged_results.jsonl
|
βββ images
π Running on distributed nodesΒΆ
A simple bash script is provided to run the process in distributed mode.
bash scripts/process_distributed.sh -f /path/to/my/input/folder
See also Distributed processing.
π ExamplesΒΆ
You can find additional example scripts in the /examples directory.
β‘ OptimizationΒΆ
ποΈ Fast modeΒΆ
For some file types, we provide a fast mode that will allow you to process the files faster, using a different method. To use it, set the use_fast_processors to true in the config file.
Be aware that the fast mode might not be as accurate as the default mode, especially for scanned non-native PDFs, which may require Optical Character Recognition (OCR) for more accurate extraction.
π§ File type parameters tuningΒΆ
Many parameters are hardware-dependent and can be customized to suit your needs.
For example, you can tune:
processor batch size
dispatcher batch size
number of threads per worker
You can configure parameters by providing a custom config file. You can find an example of a config file in examples/process/config.yaml.
β οΈ Not all parameters are configurable yet.
For distributed execution options, see the Quick Start and Distributed processing.
β»οΈ Incremental reprocessingΒΆ
The optional top-level previous_results parameter lets you reuse results from a prior run to avoid reprocessing unchanged files so as to save time and compute costs.
previous_results: examples/process/outputs/merged/merged_results.jsonl
Point it to a merged_results.jsonl produced by an earlier run. On the next run, each local input file is compared against that JSONL (meanwhile URL inputs are always reprocessed):
Unchanged files: their previous samples are reused as-is.
New or modified files: they are processed normally.
Removed files: their samples are dropped from the output.
π More information on whatβs under the hoodΒΆ
π§ Pipeline architectureΒΆ
Our pipeline is a 3 steps process:
Crawling
Files and folders are scanned to identify the files to process, while skipping those already processed.Dispatching
Files are dispatched to workers in batches. In distributed setups, this stage is also responsible for load balancing across nodes.Processing
Workers process files with the appropriate tools for each file type. They extract text, images, audio, and video frames, then pass the results to the next stage.
MMORE uses a common data structure for document samples: MultimodalSample.
The goal is to make it easy to add new processors for new file types, or alternative processing methods for existing ones.
π οΈ Supported file types and toolsΒΆ
The project supports multiple file types and utilizes various AI-based tools for processing. Below is a table summarizing the supported file types and corresponding tools (N/A means no choice):
File Type |
Default Mode Tool(s) |
Fast Mode Tool(s) |
|---|---|---|
DOCX |
python-docx to extract the text and images. |
N/A |
MD |
markdown for text extraction, markdownify for HTML conversion |
N/A |
PPTX |
python-pptx to extract the text and images. |
N/A |
XLSX |
openpyxl to extract the text and images. |
N/A |
TXT |
N/A |
|
EML |
N/A |
|
MP4, MOV, AVI, MKV, MP3, WAV, AAC |
moviepy for video frame extraction; whisper-large-v3-turbo for transcription |
|
marker-pdf for OCR and structured data extraction |
PyMuPDF for text and image extraction |
|
HTML |
markdownify to convert HTML to MD; requests for images |
N/A |
MMORE also uses Dask Distributed to manage distributed execution.
π§ CustomizationΒΆ
The system is designed to be extensible, allowing you to register custom processors for handling new file types or specialized processing. To implement a new processor you need to inherit the Processor class and implement only two methods:
accepts: defines which file types your processor supports (e.g. docx)process: how to process a single file (input:file type, output: Multimodal sample, see other processors for reference)
For a minimal example, see TextProcessor.
π§Ή Post-processingΒΆ
Post-processing refines the extracted text data to improve quality for downstream tasks. The infrastructure is modular and extensible: mmore natively supports the following post-processors:
Applying the Chunker is heavily recommended, as it cuts documents into reasonably sized chunks that are more specific to feed to an LLM.
The chunker supports a table_handling option to control how markdown tables are split:
Mode |
Description |
|---|---|
|
Each table row has its own chunk, with the header repeated for context |
|
Rows are grouped to fill the chunk size, header repeated per chunk |
|
Tables are never split and kept as one chunk regardless of size |
|
No special table handling, tables are chunked like regular text |
You can configure parameters by providing a custom config file. This field is shown in the example config file at examples/process/config.yaml.
Once ready, you can run the process using the following command:
python3 -m mmore postprocess --config-file examples/postprocessor/config.yaml --input-data examples/process/outputs/merged/merged_results.jsonl
Specify with --input-data the path (absolute or relative to the root of the repository) to the JSONL recording of the output of the initial processing phase.
β»οΈ Incremental post-processingΒΆ
Like the processing pipeline, the post-processor accepts an optional previous_results parameter to reuse results from a prior post-processing run and skip unchanged documents.
previous_results: examples/postprocessor/outputs/merged/results.jsonl
New post-processors can easily be implemented, and pipelines can be configured through lightweight YAML files. The post-processing stage produces a new JSONL file containing cleaned and optionally enhanced document samples.