Reference

mdify.src.parsers

DocumentParser

A class that processes documents (PDFs or images) to extract their content and save artifacts like images, tables, charts, and formulas.

Attributes:

  • save_folder (str) –

    Folder where the processed documents will be saved.

  • extract_metadata (bool) –

    Flag to indicate if metadata should be extracted.

  • metadata_filename (str) –

    The filename for saving metadata.

  • save_artifacts (OutputArtifact) –

    Specifies the types of artifacts to save.

  • debug (bool) –

    Flag to control debug mode. If True, it will keep temporary folders containing all pages and all elements extracted in each page in image format.

  • PIL_supported_formats (list) –

    List of supported image formats by PIL.

  • detector (LayoutDetector) –

    Layout detection object.

  • extractor (ContentExtractor) –

    Content extraction object.

Methods:

  • parse_directory

    str, **kwargs): Parses multiple documents from a directory.

  • parse

    Union[str, bytes], document_name: str = None, document_type: str = None, **kwargs): Parses a single document and processes it.

metadata property

Returns:

  • dict

    A dictionary where the keys are the paths to the output files and the values are the metadata objects associated with each document.

metadata_paths property

Returns:

  • dict

    A dictionary with the paths to the output files as keys and the paths to the metadata files as values.

output_files property

Returns:

  • dict

    A dictionary with the paths to the output files as keys and the corresponding file objects as values.

output_files_paths property

Returns:

  • list

    List of file paths to the files generated for each processed document.

cleanup()

Deletes the directory where processed documents and artifacts are stored and all its contents.

parse(document, document_name=None, document_type=None, **kwargs)

Converts a document into Markdown.

Parameters:

  • document (Union[str, bytes]) –

    Document to be parsed. If a string, it is the path to the document. If bytes, document_type and document_name must be specified.

  • document_name (str, default: None ) –

    Name of document if document is an instance of bytes. Defaults to None.

  • document_type (str, default: None ) –

    Type of document if document is an instance of bytes. Defaults to None.

  • **kwargs

    Additional keyword arguments passed to the rendering function and the extractor for each element.

Raises:

  • ValueError

    If document_name and document_type are not provided when parsing bytes.

parse_directory(documents_dir, **kwargs)

Parses multiple documents from a given directory.

Parameters:

  • documents_dir (str) –

    Directory containing documents to be parsed.

  • **kwargs

    Additional arguments passed to the parse method.


mdify.src.layout

LayoutDetector

Detects and organizes layout elements from a document page using a pre-trained YOLO model. This class is designed to process pages containing elements such as titles, headers, footnotes, and pictures, allowing for efficient extraction, sorting, and saving of these components.

Attributes:

  • model

    Pre-trained YOLO model for layout detection.

  • config (dict) –

    Configuration settings for the layout detection model.

  • filename_separator (str) –

    Separator used for naming cropped element files.

  • picture_classifier (PictureClassifier) –

    Classifier for recognizing picture types in the layout.

__custom_sort(elem)

Sorts layout elements based on their vertical and horizontal positions on the page. Elements are ordered first from top-to-bottom (y1) and then from left-to-right (x1).

To support multicolumn formats, elements on the x-axis are assigned a higher weight the more they are on the right of the page. This way, elements at the bottom of the page can still be correctly positioned.

Each element is also given a priority based on the type of element it is (e.g. footnotes must go at the bottom, headers at the top).

Parameters:

  • elem (tuple) –

    A tuple containing a bounding box, label, and other metadata.

Returns:

  • tuple

    Sorting key based on element type priority and position.

__get_num_of_columns(x_positions, page_width, num_bins=10, density_threshold=0.5)

Estimates the number of columns in a document page based on the density of x-coordinates.

Parameters:

  • x_positions (Iterable[float]) –

    X-coordinates of detected bounding boxes.

  • page_width (float) –

    Width of the page in pixels.

  • num_bins (Optional[int], default: 10 ) –

    Number of bins for dividing the page width. Default is 10.

  • density_threshold (Optional[float], default: 0.5 ) –

    Threshold for determining high-density regions. Default is 0.5.

Returns:

  • int ( int ) –

    Number of columns detected in the page.

detect(image_path, page_nr, output_dir='', **kwargs)

Detects and extracts layout elements from a document page image.

Parameters:

  • image_path (str) –

    Path to the input image of the document page.

  • page_nr (str) –

    Page number used for naming output files.

  • output_dir (Optional[str], default: '' ) –

    Directory where extracted elements will be saved. Default is an empty string.

  • **kwargs

    Additional parameters for the YOLO model.

Returns:

  • List

    A list of detected layout elements if output_dir is not provided. Otherwise, saves elements to the specified directory.

render(image_path, output_dir, **kwargs)

Renders annotated layout elements on the input document page image.

Parameters:

  • image_path (str) –

    Path to the input image of the document page.

  • output_dir (str) –

    Directory where the annotated image will be saved.

  • **kwargs

    Additional parameters for the detection process.

Saves

Annotated image with bounding boxes and labels drawn over the original image.


mdify.src.models

ChartDeplotModel

Bases: HuggingFaceModel

Specialized Hugging Face model for extracting data from charts.

Attributes:

  • placeholder (str) –

    Placeholder token used in the model's output.

  • separator (str) –

    Separator used in the model's output.

FormulaExtractionModel

Bases: HuggingFaceModel

Specialized model for extracting mathematical formulas from images.

HuggingFaceModel

A generic wrapper for Hugging Face models that supports text and vision-based predictions.

Attributes:

  • processor

    Preprocessor for the specific Hugging Face model.

  • model

    Pretrained Hugging Face model for predictions.

  • config (dict) –

    Additional configurations for model generation.

  • prompt (str) –

    Text prompt to be used during prediction (if applicable).

  • use_pixel_values (bool) –

    Whether to use pixel values instead of text tokens.

predict(image_path)

Generates a prediction for the input image using the Hugging Face model.

Parameters:

  • image_path (str) –

    Path to the input image.

Returns:

  • The processed output from the model's predictions.

ImageCaptioningModel

Bases: HuggingFaceModel

Hugging Face model for generating captions for images.

PictureClassifier

Classifies images into categories such as "Picture" or "Chart" using a pretrained Swin Transformer model.

Attributes:

  • model (SwinForImageClassification) –

    Pretrained Swin model for image classification.

  • device (device) –

    Device to perform computations, either CPU or GPU.

  • transform (Compose) –

    Preprocessing transformations for input images.

  • id2class (dict) –

    Mapping of class indices to class names.

classify(image)

Classifies an input image into predefined categories, using GPU if available.

Parameters:

  • image (str | ndarray) –

    Path to the image or an image in the form of a NumPy array.

Returns:

  • int ( int ) –

    Predicted class index of the image.


mdify.src.ocr

OCR

Abstract base class for OCR processing. It provides a common interface and workflow for derived OCR classes.

Methods:

  • process

    Main method to process an image and perform OCR.

  • process_results

    Abstract method to be implemented by subclasses for processing OCR results.

process(image_path, save_dir, filename, save_artifacts=OutputArtifact.NONE, write_mode=WriteMode.EMBEDDED, debug=False, **kwargs)

Processes the image and initializes OCR attributes.

Parameters:

  • image_path (str) –

    Path to the input image.

  • save_dir (str) –

    Directory to save OCR artifacts.

  • filename (str) –

    Base filename for saving artifacts.

  • save_artifacts (Optional[OutputArtifact], default: NONE ) –

    Specifies which artifacts to save.

  • write_mode (Optional[WriteMode], default: EMBEDDED ) –

    Determines the output format of extracted text.

  • debug (Optional[bool], default: False ) –

    If True, saves a debug image.

  • **kwargs

    Additional arguments for customization.

Raises:

  • Exception

    If the image cannot be processed.

process_results(**kwargs) abstractmethod

Abstract method to process OCR results. Subclasses must implement this method.

PictureRecognizer

Bases: OCR

Class for recognizing and processing image-based data such as charts, captions, or formulas.

Attributes:

  • chart_qa_model

    Model for extracting data from charts.

  • captioning_model

    Model for generating captions for images.

  • formula_extraction_model

    Model for extracting mathematical formulas.

__make_save_path(extension=IMAGES_SAVE_EXTENSION, **kwargs)

Creates the save path for the image artifact.

__save_image(**kwargs)

Saves the processed image.

process_results(**kwargs)

Processes the image results based on its type (chart, formula, or general image).

Parameters:

  • save_dir (str) –

    Directory to save OCR artifacts.

TableRecognizer

Bases: OCR

Class for recognizing and organizing table data from images.

Attributes:

  • config (dict) –

    Configuration for table OCR model.

  • model (PaddleOCR) –

    Instance of PaddleOCR model.

__find_column_boundaries(bboxes, tolerance=10)

Finds boundaries of table columns based on bounding boxes.

Parameters:

  • bboxes (Iterable[Iterable[float]]) –

    List of bounding boxes.

  • tolerance (int, default: 10 ) –

    Margin of error in pixels for boundary grouping.

Returns:

  • Iterable[float]

    Iterable[float]: Sorted list of column boundaries.

__find_header(column_boundaries, bboxes, tolerance=10)

Identifies the row index where the table header ends.

Parameters:

  • column_boundaries (Iterable[float]) –

    List of column boundaries.

  • bboxes (ndarray) –

    Array of bounding boxes.

  • tolerance (int, default: 10 ) –

    Margin of error in pixels for header detection.

Returns:

  • int ( int ) –

    Row index where the header ends.

ocr(img, cls=True, **kwargs)

Performs OCR on the provided image and extracts table text and bounding boxes.

Parameters:

  • img (ndarray) –

    Input image as a NumPy array.

  • cls (bool, default: True ) –

    Whether to use classification.

  • **kwargs

    Additional arguments passed to the OCR model.

process_results(**kwargs)

Organizes text into table columns based on bounding box x-coordinates and returns a DataFrame. Ensures every column is populated with either text or None for all rows.

Parameters:

  • row_tolerance (int) –

    Margin of error in pixels to assign elements to the correct row.

  • col_tolerance (int) –

    Margin of error in pixels to assign elements to the correct column.

render(img, save_full_path, font_path='../utils/simfang.ttf', **kwargs)

Renders the recognized data onto the input image and saves it.

Parameters:

  • img (ndarray | str) –

    Input image as an array or path.

  • save_full_path (str) –

    Full path to save the rendered image.

  • font_path (str, default: '../utils/simfang.ttf' ) –

    Path to the font used for rendering.

  • **kwargs

    Additional arguments for rendering.

TextRecognizer

Bases: OCR

Class for recognizing text from images using OCR. Two different models are used for headers and paragraphs as it has been observed that using only one of them for both yields less accurate results.

Attributes:

  • paragraph_model_config (dict) –

    Configuration for paragraph-level OCR.

  • header_reader (Reader) –

    Reader instance for header-level OCR.

  • det_processor, ((det_model, rec_model, rec_processor)) –

    Components for OCR processing.

  • default_element_type (str) –

    Default type of element to process ('paragraph').

process_results(**kwargs)

Processes OCR results based on the specified element type (paragraph or header).

Parameters:

  • element_to_process (str) –

    either 'header' or 'paragraph'.


mdify.src.output

OutputArtifact

Bases: Enum

Enum representing the types of artifacts to be saved.

Types

NONE (int): No artifacts are saved. ONLY_PICTURES (int): Only images are saved. ONLY_TABLES (int): Only tables are saved. ONLY_CHARTS (int): Only charts are saved. ONLY_FORMULAS (int): Only formulas are saved. PICTURES_AND_TABLES (int): Both images and tables are saved. PICTURES_AND_CHARTS (int): Both images and charts are saved. PICTURES_AND_FORMULAS (int): Both images and formulas are saved. TABLES_AND_CHARTS (int): Both tables and charts are saved. TABLES_AND_FORMULAS (int): Both tables and formulas are saved. CHARTS_AND_FORMULAS (int): Both charts and formulas are saved. PICTURES_TABLES_AND_CHARTS (int): Images, tables, and charts are saved. PICTURES_TABLES_AND_FORMULAS (int): Images, tables, and formulas are saved. PICTURES_CHARTS_AND_FORMULAS (int): Images, charts, and formulas are saved. TABLES_CHARTS_AND_FORMULAS (int): Tables, charts, and formulas are saved. ALL (int): All artifacts (images, tables, charts, formulas) are saved.

OutputWriter

Handles the process of writing content and optionally saving specified types of artifacts.

Methods:

  • write

    str, save_dir: str, filename: str): Writes the content to a file in the specified directory, creating the directory if needed.

write(content, save_dir, filename) staticmethod

Writes the content to a markdown file in the specified directory.

Parameters:

  • content (str) –

    The content to write to the file.

  • save_dir (str) –

    Directory where the file will be saved.

  • filename (str) –

    Name of the file (without extension).

WriteMode

Bases: Enum

Enum representing the modes for writing content with embedded or referenced artifacts.

Modes

EMBEDDED (int): Artifacts (e.g., images, tables) are directly embedded in the output content. PLACEHOLDER (int): Placeholders for artifacts are added in the output content, requiring external references. DESCRIBED (int): Artifacts are described textually, without direct embedding or placeholders.


mdify.src.extractors

ContentExtractor

Extracts content from a document image based on specified types such as text, tables, pictures, and more. This class delegates extraction tasks to specialized recognizers for text, tables, and pictures.

Attributes:

  • text_recognizer (TextRecognizer) –

    Recognizer for extracting textual elements.

  • table_recognizer (TableRecognizer) –

    Recognizer for extracting table elements.

  • picture_recognizer (PictureRecognizer) –

    Recognizer for extracting picture elements.

extract(image_path, extract_type, **kwargs)

Method factory that extracts content from the specified image based on the type of element requested.

Parameters:

  • image_path (str) –

    Path to the input image of the document.

  • extract_type (str) –

    Type of content to extract (e.g., 'text', 'table', 'picture', etc.).

  • **kwargs

    Additional parameters for the specific extraction method.

Returns:

  • str ( str ) –

    Extracted content as a string.


mdify.src.utils

convert_image_to_pdf(image_path, pdf_path)

Converts an image file to a PDF.

Parameters:

  • image_path (str) –

    Path to the input image file.

  • pdf_path (str) –

    Path to save the output PDF file.

convert_to_jpeg(im)

Converts an image to JPEG format.

Parameters:

  • im (Image) –

    The input image.

Returns: PIL.Image: The converted image in JPEG format.

get_filename(path, include_extension=False)

Extracts the filename from a given path.

Parameters:

  • path (str) –

    The full path of the file.

  • include_extension (bool, default: False ) –

    Whether to include the file extension. Defaults to False.

Returns: str: The extracted filename.

open_image(img, to_numpy=False)

Opens an image file or array, converting it to JPEG format if necessary.

Parameters:

  • img (str | ndarray) –

    Path to the image file or an image array.

  • to_numpy (bool, default: False ) –

    Whether to return the image as a NumPy array. Defaults to False.

Returns: Union[np.ndarray, PIL.Image]: The opened image.