Loaders — cognity-ai API

BaseLoader

All loaders implement the BaseLoader abstract base class. You can subclass it to add support for custom file formats.

Python

from abc import ABC, abstractmethod
from cognity_ai.models import Document

class BaseLoader(ABC):

    @abstractmethod
    def load(self, path: str) -> list[Document]: ...

    @property
    @abstractmethod
    def supported_extensions(self) -> list[str]: ...

load(path) abstract method

Load and parse a file at path, returning a list of Document objects. Multi-page formats like PDF return one Document per file (not per page) with a populated page_map.

path str Absolute or relative path to the source file.

Returns list[Document] — typically one element; some loaders (e.g. ExcelLoader) may return one per sheet.

supported_extensions abstract property

Returns the list of file extensions this loader handles, each including the leading dot (e.g. [".pdf"]). Used by LoaderFactory to auto-select the correct loader.

Returns list[str]

Document Model

The universal document representation returned by all loaders. Downstream pipeline stages (chunkers, embedders, extractors) operate on this type.

Python

from cognity_ai.models import Document, PageInfo, ImageRef

@dataclass
class Document:
    id:          str             # UUID — auto-generated on load
    text:        str             # Full extracted plain text
    metadata:    dict            # source, title, page_count, author, …
    source_path: str             # Absolute path to the original file
    page_map:    list[PageInfo]  # Per-page character offsets + headings
    image_refs:  list[ImageRef]  # Embedded image positions in the text

@dataclass
class PageInfo:
    page_number: int
    char_start:  int
    char_end:    int
    headings:    list[str]

@dataclass
class ImageRef:
    image_id:    str    # Unique ID within the document
    char_offset: int    # Position in text where the image is anchored
    page_number: int
    caption:     str    # OCR-extracted or explicit caption

Field	Type	Description
`id`	`str`	UUID generated at load time. Stable for the same file path.
`text`	`str`	Full extracted plain text, normalised to UTF-8 with page boundaries marked by `\n\n`.
`metadata`	`dict`	Loader-specific metadata. Common keys: `source`, `title`, `author`, `page_count`, `created`, `encoding`.
`source_path`	`str`	Absolute path to the original file on disk.
`page_map`	`list[PageInfo]`	One entry per page. Maps page numbers to character offsets in `text`.
`image_refs`	`list[ImageRef]`	References to embedded images with their text anchor positions and captions.

Supported Formats

cognity-ai ships loaders for 11 file format families. All are auto-selected by LoaderFactory based on file extension.

Loader Class	Extensions	Key Dependency	Metadata Extracted
`TxtLoader`	.txt	stdlib	filename, encoding, line count
`MdLoader`	.md .markdown	stdlib	H1–H6 headings as sections
`PdfLoader`	.pdf	pdfplumber, pypdf	page numbers, headings, tables, embedded images
`DocxLoader`	.docx .doc	python-docx	headings, tables, page breaks, embedded images
`ExcelLoader`	.xlsx .xls	openpyxl, pandas	sheet names, row-to-text conversion, formula values
`CsvLoader`	.csv .tsv	pandas	headers, delimiter auto-detect
`PptxLoader`	.pptx .ppt	python-pptx	slide numbers, speaker notes, embedded images
`HtmlLoader`	.html .htm	beautifulsoup4	title, headings, links
`JsonLoader`	.json	stdlib	key structure, depth
`YamlLoader`	.yaml .yml	PyYAML	key structure
`ImageLoader`	.jpg .jpeg .png .bmp .tiff .webp .gif	OCR subsystem	filename, OCR confidence score

ℹ️

Note ImageLoader delegates to the OCR backend configured in LibraryConfig.ocr. Set ocr="gemini_vision", "aws_textract", "google_vision", or "tesseract" depending on your deployment. For scanned PDFs, PdfLoader automatically invokes the same OCR backend when native text extraction yields empty pages.

LoaderFactory

The recommended way to get a loader. LoaderFactory maintains an extension-to-loader registry and auto-selects the correct loader for any given file path.

Python

from cognity_ai.loaders import LoaderFactory

factory = LoaderFactory()

# Auto-detect from extension — returns the appropriate BaseLoader subclass
loader = factory.get_loader("report.pdf")
docs = loader.load("report.pdf")

# Load a directory — recurses and dispatches per-file
all_docs = factory.load_directory("./documents/")

# Register a custom loader for a proprietary format
factory.register(".myext", MyCustomLoader)

get_loader(path) method

Look up and return the loader registered for the extension of path.

pathstrFile path or filename. Only the extension is used for lookup.

Returns BaseLoader Raises UnsupportedFormatError if no loader is registered for the extension.

load_directory(path, recursive=True, glob="**/*") method

Walk a directory and load all files whose extensions are registered. Unknown extensions are silently skipped.

pathstrRoot directory to scan.
recursiveboolWhether to descend into subdirectories. Default True.
globstrGlob pattern to filter files. Default "**/*".

Returns list[Document]

extensionstrFile extension including the leading dot, e.g. ".xyz".
loader_classtype[BaseLoader]The loader class (not an instance) to register.

Direct Loader Usage

You can also instantiate loaders directly, which is useful when you need access to loader-specific options:

Python

from cognity_ai.loaders import PdfLoader, DocxLoader, CsvLoader

# PDF — extract text and images
pdf_loader = PdfLoader(extract_images=True, ocr_fallback=True)
docs = pdf_loader.load("annual_report.pdf")
doc = docs[0]
print(doc.metadata["page_count"])   # → 42
print(len(doc.page_map))            # → 42
print(len(doc.image_refs))          # → number of embedded images

# DOCX — headings and tables
docx_loader = DocxLoader()
docs = docx_loader.load("spec.docx")

# CSV — custom delimiter
csv_loader = CsvLoader(delimiter=";")
docs = csv_loader.load("data.csv")
print(docs[0].metadata["headers"])   # → ["col1", "col2", …]

PDF Utilities

Lower-level helpers in cognity_ai.loaders.pdf_utils for programmatic PDF manipulation independent of the loader pipeline.

Python

from cognity_ai.loaders.pdf_utils import (
    extract_tables,   # → list[pd.DataFrame], one per page
    extract_images,   # → list[bytes], raw image bytes per page
    extract_metadata, # → dict  (author, title, created, page_count, …)
    slice_pages,      # → bytes  (sub-PDF from page range)
    merge_pdfs,       # → bytes  (merged single PDF)
    pdf_to_images,    # → list[PIL.Image]  (full-page raster for OCR)
)

Function	Signature	Returns	Description
`extract_tables`	`(path, pages=None)`	`list[DataFrame]`	Extract all tables from the PDF as pandas DataFrames. Pass `pages=[1,3]` to limit to specific pages.
`extract_images`	`(path, pages=None)`	`list[bytes]`	Return raw image bytes for every embedded image. Includes JPEG, PNG, and JBIG2 streams.
`extract_metadata`	`(path)`	`dict`	Return document-level metadata: `author`, `title`, `subject`, `creator`, `created`, `page_count`.
`slice_pages`	`(path, start, end)`	`bytes`	Extract pages `start`–`end` (1-indexed, inclusive) as a new in-memory PDF. Write to disk with `open(..., "wb").write(result)`.
`merge_pdfs`	`(paths)`	`bytes`	Concatenate multiple PDFs in order. Accepts a list of file paths.
`pdf_to_images`	`(path, dpi=150)`	`list[PIL.Image]`	Render each page as a PIL Image at the specified DPI. Useful for full-page OCR pipelines.

PDF Utilities Example

Python

from cognity_ai.loaders.pdf_utils import extract_tables, slice_pages, pdf_to_images

# Extract all tables from a financial report
tables = extract_tables("report.pdf")
for df in tables:
    print(df.to_string())

# Slice out the appendix (pages 40–55) as a new PDF
appendix_bytes = slice_pages("report.pdf", start=40, end=55)
with open("appendix.pdf", "wb") as f:
    f.write(appendix_bytes)

# Render pages as images for vision model processing
images = pdf_to_images("scanned_doc.pdf", dpi=300)
for i, img in enumerate(images):
    img.save(f"page_{i+1}.png")

Writing a Custom Loader

Subclass BaseLoader, implement the two abstract members, and register with LoaderFactory:

Python

from cognity_ai.loaders import BaseLoader, LoaderFactory
from cognity_ai.models import Document
import uuid

class EpubLoader(BaseLoader):

    @property
    def supported_extensions(self) -> list[str]:
        return [".epub"]

    def load(self, path: str) -> list[Document]:
        # parse the epub, extract text …
        text = _parse_epub(path)
        return [
            Document(
                id=str(uuid.uuid4()),
                text=text,
                metadata={"source": path},
                source_path=path,
                page_map=[],
                image_refs=[],
            )
        ]

# Register globally
factory = LoaderFactory()
factory.register(".epub", EpubLoader)