Reference

`mdify.src.parsers`

`DocumentParser`

A class that processes documents (PDFs or images) to extract their content and save artifacts like images, tables, charts, and formulas.

Attributes:

save_folder (str) –

Folder where the processed documents will be saved.
extract_metadata (bool) –

Flag to indicate if metadata should be extracted.
metadata_filename (str) –

The filename for saving metadata.
save_artifacts (OutputArtifact) –

Specifies the types of artifacts to save.
debug (bool) –

Flag to control debug mode. If True, it will keep temporary folders containing all pages and all elements extracted in each page in image format.
PIL_supported_formats (list) –

List of supported image formats by PIL.
detector (LayoutDetector) –

Layout detection object.
extractor (ContentExtractor) –

Content extraction object.

Methods:

parse_directory –

str, **kwargs): Parses multiple documents from a directory.
parse –

Union[str, bytes], document_name: str = None, document_type: str = None, **kwargs): Parses a single document and processes it.

`metadata` `property`

Returns:

dict –

A dictionary where the keys are the paths to the output files and the values are the metadata objects associated with each document.

`metadata_paths` `property`

Returns:

dict –

A dictionary with the paths to the output files as keys and the paths to the metadata files as values.

`output_files` `property`

Returns:

dict –

A dictionary with the paths to the output files as keys and the corresponding file objects as values.

`output_files_paths` `property`

Returns:

list –

List of file paths to the files generated for each processed document.

`cleanup()`

Deletes the directory where processed documents and artifacts are stored and all its contents.

`parse(document, document_name=None, document_type=None, **kwargs)`

Converts a document into Markdown.

Parameters:

document (Union[str, bytes]) –

Document to be parsed. If a string, it is the path to the document. If bytes, document_type and document_name must be specified.
document_name (str, default: None ) –

Name of document if document is an instance of bytes. Defaults to None.
document_type (str, default: None ) –

Type of document if document is an instance of bytes. Defaults to None.
**kwargs –

Additional keyword arguments passed to the rendering function and the extractor for each element.

Raises:

ValueError –

If document_name and document_type are not provided when parsing bytes.

`parse_directory(documents_dir, **kwargs)`

Parses multiple documents from a given directory.

Parameters:

documents_dir (str) –

Directory containing documents to be parsed.
**kwargs –

Additional arguments passed to the parse method.

`mdify.src.layout`

`LayoutDetector`

Detects and organizes layout elements from a document page using a pre-trained YOLO model. This class is designed to process pages containing elements such as titles, headers, footnotes, and pictures, allowing for efficient extraction, sorting, and saving of these components.

Attributes:

model –

Pre-trained YOLO model for layout detection.
config (dict) –

Configuration settings for the layout detection model.
filename_separator (str) –

Separator used for naming cropped element files.
picture_classifier (PictureClassifier) –

Classifier for recognizing picture types in the layout.

`__custom_sort(elem)`

Sorts layout elements based on their vertical and horizontal positions on the page. Elements are ordered first from top-to-bottom (y1) and then from left-to-right (x1).

To support multicolumn formats, elements on the x-axis are assigned a higher weight the more they are on the right of the page. This way, elements at the bottom of the page can still be correctly positioned.

Each element is also given a priority based on the type of element it is (e.g. footnotes must go at the bottom, headers at the top).

Parameters:

elem (tuple) –

A tuple containing a bounding box, label, and other metadata.

Returns:

tuple –

Sorting key based on element type priority and position.

`__get_num_of_columns(x_positions, page_width, num_bins=10, density_threshold=0.5)`

Estimates the number of columns in a document page based on the density of x-coordinates.

Parameters:

x_positions (Iterable[float]) –

X-coordinates of detected bounding boxes.
page_width (float) –

Width of the page in pixels.
num_bins (Optional[int], default: 10 ) –

Number of bins for dividing the page width. Default is 10.
density_threshold (Optional[float], default: 0.5 ) –

Threshold for determining high-density regions. Default is 0.5.

Returns:

int ( int ) –

Number of columns detected in the page.

`detect(image_path, page_nr, output_dir='', **kwargs)`

Detects and extracts layout elements from a document page image.

Parameters:

image_path (str) –

Path to the input image of the document page.
page_nr (str) –

Page number used for naming output files.
output_dir (Optional[str], default: '' ) –

Directory where extracted elements will be saved. Default is an empty string.
**kwargs –

Additional parameters for the YOLO model.

Returns:

List –

A list of detected layout elements if output_dir is not provided. Otherwise, saves elements to the specified directory.

`render(image_path, output_dir, **kwargs)`

Renders annotated layout elements on the input document page image.

Parameters:

image_path (str) –

Path to the input image of the document page.
output_dir (str) –

Directory where the annotated image will be saved.
**kwargs –

Additional parameters for the detection process.

Saves

Annotated image with bounding boxes and labels drawn over the original image.

`mdify.src.models`

`ChartDeplotModel`

Bases: HuggingFaceModel

Specialized Hugging Face model for extracting data from charts.

Attributes:

placeholder (str) –

Placeholder token used in the model's output.
separator (str) –

Separator used in the model's output.

`FormulaExtractionModel`

Bases: HuggingFaceModel

Specialized model for extracting mathematical formulas from images.

`HuggingFaceModel`

A generic wrapper for Hugging Face models that supports text and vision-based predictions.

Attributes:

processor –

Preprocessor for the specific Hugging Face model.
model –

Pretrained Hugging Face model for predictions.
config (dict) –

Additional configurations for model generation.
prompt (str) –

Text prompt to be used during prediction (if applicable).
use_pixel_values (bool) –

Whether to use pixel values instead of text tokens.

`predict(image_path)`

Generates a prediction for the input image using the Hugging Face model.

Parameters:

image_path (str) –

Path to the input image.

Returns:

–

The processed output from the model's predictions.

`ImageCaptioningModel`

Bases: HuggingFaceModel

Hugging Face model for generating captions for images.

`PictureClassifier`

Classifies images into categories such as "Picture" or "Chart" using a pretrained Swin Transformer model.

Attributes:

model (SwinForImageClassification) –

Pretrained Swin model for image classification.
device (device) –

Device to perform computations, either CPU or GPU.
transform (Compose) –

Preprocessing transformations for input images.
id2class (dict) –

Mapping of class indices to class names.

`classify(image)`

Classifies an input image into predefined categories, using GPU if available.

Parameters:

image (str | ndarray) –

Path to the image or an image in the form of a NumPy array.

Returns:

int ( int ) –

Predicted class index of the image.

`mdify.src.ocr`

`OCR`

Abstract base class for OCR processing. It provides a common interface and workflow for derived OCR classes.

Methods:

process –

Main method to process an image and perform OCR.
process_results –

Abstract method to be implemented by subclasses for processing OCR results.

`process(image_path, save_dir, filename, save_artifacts=OutputArtifact.NONE, write_mode=WriteMode.EMBEDDED, debug=False, **kwargs)`

Processes the image and initializes OCR attributes.

Parameters:

image_path (str) –

Path to the input image.
save_dir (str) –

Directory to save OCR artifacts.
filename (str) –

Base filename for saving artifacts.
save_artifacts (Optional[OutputArtifact], default: NONE ) –

Specifies which artifacts to save.
write_mode (Optional[WriteMode], default: EMBEDDED ) –

Determines the output format of extracted text.
debug (Optional[bool], default: False ) –

If True, saves a debug image.
**kwargs –

Additional arguments for customization.

Raises:

Exception –

If the image cannot be processed.

`process_results(**kwargs)` `abstractmethod`

Abstract method to process OCR results. Subclasses must implement this method.

`PictureRecognizer`

Bases: OCR

Class for recognizing and processing image-based data such as charts, captions, or formulas.

Attributes:

chart_qa_model –

Model for extracting data from charts.
captioning_model –

Model for generating captions for images.
formula_extraction_model –

Model for extracting mathematical formulas.

`__make_save_path(extension=IMAGES_SAVE_EXTENSION, **kwargs)`

Creates the save path for the image artifact.

`__save_image(**kwargs)`

Saves the processed image.

`process_results(**kwargs)`

Processes the image results based on its type (chart, formula, or general image).

Parameters:

save_dir (str) –

Directory to save OCR artifacts.

`TableRecognizer`

Bases: OCR

Class for recognizing and organizing table data from images.

Attributes:

config (dict) –

Configuration for table OCR model.
model (PaddleOCR) –

Instance of PaddleOCR model.

`__find_column_boundaries(bboxes, tolerance=10)`

Finds boundaries of table columns based on bounding boxes.

Parameters:

bboxes (Iterable[Iterable[float]]) –

List of bounding boxes.
tolerance (int, default: 10 ) –

Margin of error in pixels for boundary grouping.

Returns:

Iterable[float] –

Iterable[float]: Sorted list of column boundaries.

`__find_header(column_boundaries, bboxes, tolerance=10)`

Identifies the row index where the table header ends.

Parameters:

column_boundaries (Iterable[float]) –

List of column boundaries.
bboxes (ndarray) –

Array of bounding boxes.
tolerance (int, default: 10 ) –

Margin of error in pixels for header detection.

Returns:

int ( int ) –

Row index where the header ends.

`ocr(img, cls=True, **kwargs)`

Performs OCR on the provided image and extracts table text and bounding boxes.

Parameters:

img (ndarray) –

Input image as a NumPy array.
cls (bool, default: True ) –

Whether to use classification.
**kwargs –

Additional arguments passed to the OCR model.

`process_results(**kwargs)`

Organizes text into table columns based on bounding box x-coordinates and returns a DataFrame. Ensures every column is populated with either text or None for all rows.

Parameters:

row_tolerance (int) –

Margin of error in pixels to assign elements to the correct row.
col_tolerance (int) –

Margin of error in pixels to assign elements to the correct column.

`render(img, save_full_path, font_path='../utils/simfang.ttf', **kwargs)`

Renders the recognized data onto the input image and saves it.

Parameters:

img (ndarray | str) –

Input image as an array or path.
save_full_path (str) –

Full path to save the rendered image.
font_path (str, default: '../utils/simfang.ttf' ) –

Path to the font used for rendering.
**kwargs –

Additional arguments for rendering.

`TextRecognizer`

Bases: OCR

Class for recognizing text from images using OCR. Two different models are used for headers and paragraphs as it has been observed that using only one of them for both yields less accurate results.

Attributes:

paragraph_model_config (dict) –

Configuration for paragraph-level OCR.
header_reader (Reader) –

Reader instance for header-level OCR.
det_processor, ((det_model, rec_model, rec_processor)) –

Components for OCR processing.
default_element_type (str) –

Default type of element to process ('paragraph').

`process_results(**kwargs)`

Processes OCR results based on the specified element type (paragraph or header).

Parameters:

element_to_process (str) –

either 'header' or 'paragraph'.

`mdify.src.output`

`OutputArtifact`

Bases: Enum

Enum representing the types of artifacts to be saved.

Types

NONE (int): No artifacts are saved. ONLY_PICTURES (int): Only images are saved. ONLY_TABLES (int): Only tables are saved. ONLY_CHARTS (int): Only charts are saved. ONLY_FORMULAS (int): Only formulas are saved. PICTURES_AND_TABLES (int): Both images and tables are saved. PICTURES_AND_CHARTS (int): Both images and charts are saved. PICTURES_AND_FORMULAS (int): Both images and formulas are saved. TABLES_AND_CHARTS (int): Both tables and charts are saved. TABLES_AND_FORMULAS (int): Both tables and formulas are saved. CHARTS_AND_FORMULAS (int): Both charts and formulas are saved. PICTURES_TABLES_AND_CHARTS (int): Images, tables, and charts are saved. PICTURES_TABLES_AND_FORMULAS (int): Images, tables, and formulas are saved. PICTURES_CHARTS_AND_FORMULAS (int): Images, charts, and formulas are saved. TABLES_CHARTS_AND_FORMULAS (int): Tables, charts, and formulas are saved. ALL (int): All artifacts (images, tables, charts, formulas) are saved.

`OutputWriter`

Handles the process of writing content and optionally saving specified types of artifacts.

Methods:

write –

str, save_dir: str, filename: str): Writes the content to a file in the specified directory, creating the directory if needed.

`write(content, save_dir, filename)` `staticmethod`

Writes the content to a markdown file in the specified directory.

Parameters:

content (str) –

The content to write to the file.
save_dir (str) –

Directory where the file will be saved.
filename (str) –

Name of the file (without extension).

`WriteMode`

Bases: Enum

Enum representing the modes for writing content with embedded or referenced artifacts.

Modes

EMBEDDED (int): Artifacts (e.g., images, tables) are directly embedded in the output content. PLACEHOLDER (int): Placeholders for artifacts are added in the output content, requiring external references. DESCRIBED (int): Artifacts are described textually, without direct embedding or placeholders.

`mdify.src.extractors`

`ContentExtractor`

Extracts content from a document image based on specified types such as text, tables, pictures, and more. This class delegates extraction tasks to specialized recognizers for text, tables, and pictures.

Attributes:

text_recognizer (TextRecognizer) –

Recognizer for extracting textual elements.
table_recognizer (TableRecognizer) –

Recognizer for extracting table elements.
picture_recognizer (PictureRecognizer) –

Recognizer for extracting picture elements.

`extract(image_path, extract_type, **kwargs)`

Method factory that extracts content from the specified image based on the type of element requested.

Parameters:

image_path (str) –

Path to the input image of the document.
extract_type (str) –

Type of content to extract (e.g., 'text', 'table', 'picture', etc.).
**kwargs –

Additional parameters for the specific extraction method.

Returns:

str ( str ) –

Extracted content as a string.

`mdify.src.utils`

`convert_image_to_pdf(image_path, pdf_path)`

Converts an image file to a PDF.

Parameters:

image_path (str) –

Path to the input image file.
pdf_path (str) –

Path to save the output PDF file.

`convert_to_jpeg(im)`

Converts an image to JPEG format.

Parameters:

im (Image) –

The input image.

Returns: PIL.Image: The converted image in JPEG format.

`get_filename(path, include_extension=False)`

Extracts the filename from a given path.

Parameters:

path (str) –

The full path of the file.
include_extension (bool, default: False ) –

Whether to include the file extension. Defaults to False.

Returns: str: The extracted filename.

`open_image(img, to_numpy=False)`

Opens an image file or array, converting it to JPEG format if necessary.

Parameters:

img (str | ndarray) –

Path to the image file or an image array.
to_numpy (bool, default: False ) –

Whether to return the image as a NumPy array. Defaults to False.

Returns: Union[np.ndarray, PIL.Image]: The opened image.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

mdify.src.parsers

DocumentParser

metadata property

metadata_paths property

output_files property

output_files_paths property

cleanup()

parse(document, document_name=None, document_type=None, **kwargs)

parse_directory(documents_dir, **kwargs)

mdify.src.layout

LayoutDetector

__custom_sort(elem)

__get_num_of_columns(x_positions, page_width, num_bins=10, density_threshold=0.5)

detect(image_path, page_nr, output_dir='', **kwargs)

render(image_path, output_dir, **kwargs)

mdify.src.models

ChartDeplotModel

FormulaExtractionModel

HuggingFaceModel

predict(image_path)

ImageCaptioningModel

PictureClassifier

classify(image)

mdify.src.ocr

OCR

process(image_path, save_dir, filename, save_artifacts=OutputArtifact.NONE, write_mode=WriteMode.EMBEDDED, debug=False, **kwargs)

process_results(**kwargs) abstractmethod

PictureRecognizer

__make_save_path(extension=IMAGES_SAVE_EXTENSION, **kwargs)

__save_image(**kwargs)

process_results(**kwargs)

TableRecognizer

__find_column_boundaries(bboxes, tolerance=10)

__find_header(column_boundaries, bboxes, tolerance=10)

ocr(img, cls=True, **kwargs)

process_results(**kwargs)

render(img, save_full_path, font_path='../utils/simfang.ttf', **kwargs)

TextRecognizer

process_results(**kwargs)

mdify.src.output

OutputArtifact

OutputWriter

write(content, save_dir, filename) staticmethod

WriteMode

mdify.src.extractors

ContentExtractor

extract(image_path, extract_type, **kwargs)

mdify.src.utils

convert_image_to_pdf(image_path, pdf_path)

convert_to_jpeg(im)

get_filename(path, include_extension=False)

open_image(img, to_numpy=False)

`mdify.src.parsers`

`DocumentParser`

`metadata` `property`

`metadata_paths` `property`

`output_files` `property`

`output_files_paths` `property`

`cleanup()`

`parse(document, document_name=None, document_type=None, **kwargs)`

`parse_directory(documents_dir, **kwargs)`

`mdify.src.layout`

`LayoutDetector`

`__custom_sort(elem)`

`__get_num_of_columns(x_positions, page_width, num_bins=10, density_threshold=0.5)`

`detect(image_path, page_nr, output_dir='', **kwargs)`

`render(image_path, output_dir, **kwargs)`

`mdify.src.models`

`ChartDeplotModel`

`FormulaExtractionModel`

`HuggingFaceModel`

`predict(image_path)`

`ImageCaptioningModel`

`PictureClassifier`

`classify(image)`

`mdify.src.ocr`

`OCR`

`process(image_path, save_dir, filename, save_artifacts=OutputArtifact.NONE, write_mode=WriteMode.EMBEDDED, debug=False, **kwargs)`

`process_results(**kwargs)` `abstractmethod`

`PictureRecognizer`

`__make_save_path(extension=IMAGES_SAVE_EXTENSION, **kwargs)`

`__save_image(**kwargs)`

`process_results(**kwargs)`

`TableRecognizer`

`__find_column_boundaries(bboxes, tolerance=10)`

`__find_header(column_boundaries, bboxes, tolerance=10)`

`ocr(img, cls=True, **kwargs)`

`process_results(**kwargs)`

`render(img, save_full_path, font_path='../utils/simfang.ttf', **kwargs)`

`TextRecognizer`

`process_results(**kwargs)`

`mdify.src.output`

`OutputArtifact`

`OutputWriter`

`write(content, save_dir, filename)` `staticmethod`

`WriteMode`

`mdify.src.extractors`

`ContentExtractor`

`extract(image_path, extract_type, **kwargs)`

`mdify.src.utils`

`convert_image_to_pdf(image_path, pdf_path)`

`convert_to_jpeg(im)`

`get_filename(path, include_extension=False)`

`open_image(img, to_numpy=False)`