βš™οΈ ProcessΒΆ

OverviewΒΆ

The process module enables the extraction and standardization of text and images from diverse file formats (listed below), making it ideal for creating datasets for applications such as RAG, multimodal content generation, and preprocessing data for multimodal LLMs and LLMs.

πŸ”¨Quick StartΒΆ

πŸ‘©β€πŸ’» Global installationΒΆ

Set up the project on each device you want to use by following Installation.

πŸ’» Running locallyΒΆ

To run the process locally, first specify the input folders in the config file examples/process/config.yaml. You can also adjust the parameters to your needs.
Once ready, run:

python3 -m mmore process --config-file examples/process/config.yaml

πŸ“Œ Google Drive supportΒΆ

MMORE also supports processing documents directly from Google Drive.

To enable this feature, the user must create a Google service account and download the corresponding secrets as a JSON file. Name that file client_secrets.json and put it in googledrive/ (this folder may need to be created at the root of the mmore repository).

Make sure your Google service account has permission to view the drives you want to process.

Referencing Google Drive sourcesΒΆ

Google Drive folders are referenced in the process config file through the google_drive_ids field.

For example:

data_path: examples/sample_data/ # Put absolute path ! Possible to pass a list of folders 
google_drive_ids: [] # Put ids of Google Drive folders
  • data_path is used for local input folder

  • google_drive_ids is used to provide one or more Google Drive folder IDs to process

To process documents from Google Drive, add the folder IDs to the list:

data_path: examples/sample_data/
google_drive_ids:
  - your_google_drive_folder_id
  - another_google_drive_folder_id

You can use local folders, Google Drive folders, or both in the same configuration.

Make sure each referenced Google Drive folder is shared with the service account used by MMORE.

You can find an example config file in examples/process/config.yaml.

πŸ“‚ Output structureΒΆ

The output of the pipeline has the following structure:

output_path
β”œβ”€β”€ processors
β”‚   β”œβ”€β”€ Processor_type_1
β”‚   β”‚   └── results.jsonl
β”‚   β”œβ”€β”€ Processor_type_2
β”‚   β”‚   └── results.jsonl
β”‚   β”œβ”€β”€ ...
β”‚   
└── merged
β”‚    └── merged_results.jsonl
|
└── images

πŸš€ Running on distributed nodesΒΆ

A simple bash script is provided to run the process in distributed mode.

bash scripts/process_distributed.sh -f /path/to/my/input/folder

See also Distributed processing.

πŸ“œ ExamplesΒΆ

You can find additional example scripts in the /examples directory.

⚑ Optimization¢

🏎️ Fast mode¢

For some file types, we provide a fast mode that will allow you to process the files faster, using a different method. To use it, set the use_fast_processors to true in the config file.

Be aware that the fast mode might not be as accurate as the default mode, especially for scanned non-native PDFs, which may require Optical Character Recognition (OCR) for more accurate extraction.

πŸ”§ File type parameters tuningΒΆ

Many parameters are hardware-dependent and can be customized to suit your needs.

For example, you can tune:

  • processor batch size

  • dispatcher batch size

  • number of threads per worker

You can configure parameters by providing a custom config file. You can find an example of a config file in examples/process/config.yaml.

⚠️ Not all parameters are configurable yet.

For distributed execution options, see the Quick Start and Distributed processing.

♻️ Incremental reprocessingΒΆ

The optional top-level previous_results parameter lets you reuse results from a prior run to avoid reprocessing unchanged files so as to save time and compute costs.

previous_results: examples/process/outputs/merged/merged_results.jsonl

Point it to a merged_results.jsonl produced by an earlier run. On the next run, each local input file is compared against that JSONL (meanwhile URL inputs are always reprocessed):

  • Unchanged files: their previous samples are reused as-is.

  • New or modified files: they are processed normally.

  • Removed files: their samples are dropped from the output.

πŸ“œ More information on what’s under the hoodΒΆ

🚧 Pipeline architecture¢

Our pipeline is a 3 steps process:

  1. Crawling
    Files and folders are scanned to identify the files to process, while skipping those already processed.

  2. Dispatching
    Files are dispatched to workers in batches. In distributed setups, this stage is also responsible for load balancing across nodes.

  3. Processing
    Workers process files with the appropriate tools for each file type. They extract text, images, audio, and video frames, then pass the results to the next stage.

MMORE uses a common data structure for document samples: MultimodalSample.

The goal is to make it easy to add new processors for new file types, or alternative processing methods for existing ones.

πŸ› οΈ Supported file types and toolsΒΆ

The project supports multiple file types and utilizes various AI-based tools for processing. Below is a table summarizing the supported file types and corresponding tools (N/A means no choice):

File Type

Default Mode Tool(s)

Fast Mode Tool(s)

DOCX

python-docx to extract the text and images.

N/A

MD

markdown for text extraction, markdownify for HTML conversion

N/A

PPTX

python-pptx to extract the text and images.

N/A

XLSX

openpyxl to extract the text and images.

N/A

TXT

python built-in library

N/A

EML

python built-in library

N/A

MP4, MOV, AVI, MKV, MP3, WAV, AAC

moviepy for video frame extraction; whisper-large-v3-turbo for transcription

whisper-tiny

PDF

marker-pdf for OCR and structured data extraction

PyMuPDF for text and image extraction

HTML

markdownify to convert HTML to MD; requests for images

N/A


MMORE also uses Dask Distributed to manage distributed execution.

πŸ”§ CustomizationΒΆ

The system is designed to be extensible, allowing you to register custom processors for handling new file types or specialized processing. To implement a new processor you need to inherit the Processor class and implement only two methods:

  • accepts: defines which file types your processor supports (e.g. docx)

  • process: how to process a single file (input:file type, output: Multimodal sample, see other processors for reference)

For a minimal example, see TextProcessor.

🧹 Post-processing¢

Post-processing refines the extracted text data to improve quality for downstream tasks. The infrastructure is modular and extensible: mmore natively supports the following post-processors:

Applying the Chunker is heavily recommended, as it cuts documents into reasonably sized chunks that are more specific to feed to an LLM.

The chunker supports a table_handling option to control how markdown tables are split:

Mode

Description

single_row (default)

Each table row has its own chunk, with the header repeated for context

multi_rows

Rows are grouped to fill the chunk size, header repeated per chunk

keep_whole

Tables are never split and kept as one chunk regardless of size

none

No special table handling, tables are chunked like regular text

You can configure parameters by providing a custom config file. This field is shown in the example config file at examples/process/config.yaml.

Once ready, you can run the process using the following command:

python3 -m mmore postprocess --config-file examples/postprocessor/config.yaml --input-data examples/process/outputs/merged/merged_results.jsonl

Specify with --input-data the path (absolute or relative to the root of the repository) to the JSONL recording of the output of the initial processing phase.

♻️ Incremental post-processingΒΆ

Like the processing pipeline, the post-processor accepts an optional previous_results parameter to reuse results from a prior post-processing run and skip unchanged documents.

previous_results: examples/postprocessor/outputs/merged/results.jsonl

New post-processors can easily be implemented, and pipelines can be configured through lightweight YAML files. The post-processing stage produces a new JSONL file containing cleaned and optionally enhanced document samples.

See alsoΒΆ