BaseLoader

All loaders implement the BaseLoader abstract base class. You can subclass it to add support for custom file formats.

Python
from abc import ABC, abstractmethod
from cognity_ai.models import Document

class BaseLoader(ABC):

    @abstractmethod
    def load(self, path: str) -> list[Document]: ...

    @property
    @abstractmethod
    def supported_extensions(self) -> list[str]: ...
load(path) abstract method

Load and parse a file at path, returning a list of Document objects. Multi-page formats like PDF return one Document per file (not per page) with a populated page_map.

  • path str Absolute or relative path to the source file.

Returns  list[Document] — typically one element; some loaders (e.g. ExcelLoader) may return one per sheet.

supported_extensions abstract property

Returns the list of file extensions this loader handles, each including the leading dot (e.g. [".pdf"]). Used by LoaderFactory to auto-select the correct loader.

Returns  list[str]


Document Model

The universal document representation returned by all loaders. Downstream pipeline stages (chunkers, embedders, extractors) operate on this type.

Python
from cognity_ai.models import Document, PageInfo, ImageRef

@dataclass
class Document:
    id:          str             # UUID — auto-generated on load
    text:        str             # Full extracted plain text
    metadata:    dict            # source, title, page_count, author, …
    source_path: str             # Absolute path to the original file
    page_map:    list[PageInfo]  # Per-page character offsets + headings
    image_refs:  list[ImageRef]  # Embedded image positions in the text

@dataclass
class PageInfo:
    page_number: int
    char_start:  int
    char_end:    int
    headings:    list[str]

@dataclass
class ImageRef:
    image_id:    str    # Unique ID within the document
    char_offset: int    # Position in text where the image is anchored
    page_number: int
    caption:     str    # OCR-extracted or explicit caption
FieldTypeDescription
idstrUUID generated at load time. Stable for the same file path.
textstrFull extracted plain text, normalised to UTF-8 with page boundaries marked by \n\n.
metadatadictLoader-specific metadata. Common keys: source, title, author, page_count, created, encoding.
source_pathstrAbsolute path to the original file on disk.
page_maplist[PageInfo]One entry per page. Maps page numbers to character offsets in text.
image_refslist[ImageRef]References to embedded images with their text anchor positions and captions.

Supported Formats

cognity-ai ships loaders for 11 file format families. All are auto-selected by LoaderFactory based on file extension.

Loader Class Extensions Key Dependency Metadata Extracted
TxtLoader .txt stdlib filename, encoding, line count
MdLoader .md .markdown stdlib H1–H6 headings as sections
PdfLoader .pdf pdfplumber, pypdf page numbers, headings, tables, embedded images
DocxLoader .docx .doc python-docx headings, tables, page breaks, embedded images
ExcelLoader .xlsx .xls openpyxl, pandas sheet names, row-to-text conversion, formula values
CsvLoader .csv .tsv pandas headers, delimiter auto-detect
PptxLoader .pptx .ppt python-pptx slide numbers, speaker notes, embedded images
HtmlLoader .html .htm beautifulsoup4 title, headings, links
JsonLoader .json stdlib key structure, depth
YamlLoader .yaml .yml PyYAML key structure
ImageLoader .jpg .jpeg .png .bmp .tiff .webp .gif OCR subsystem filename, OCR confidence score
ℹ️
Note ImageLoader delegates to the OCR backend configured in LibraryConfig.ocr. Set ocr="gemini_vision", "aws_textract", "google_vision", or "tesseract" depending on your deployment. For scanned PDFs, PdfLoader automatically invokes the same OCR backend when native text extraction yields empty pages.

LoaderFactory

The recommended way to get a loader. LoaderFactory maintains an extension-to-loader registry and auto-selects the correct loader for any given file path.

Python
from cognity_ai.loaders import LoaderFactory

factory = LoaderFactory()

# Auto-detect from extension — returns the appropriate BaseLoader subclass
loader = factory.get_loader("report.pdf")
docs = loader.load("report.pdf")

# Load a directory — recurses and dispatches per-file
all_docs = factory.load_directory("./documents/")

# Register a custom loader for a proprietary format
factory.register(".myext", MyCustomLoader)
get_loader(path) method

Look up and return the loader registered for the extension of path.

  • pathstrFile path or filename. Only the extension is used for lookup.

Returns  BaseLoader    Raises  UnsupportedFormatError if no loader is registered for the extension.

load_directory(path, recursive=True, glob="**/*") method

Walk a directory and load all files whose extensions are registered. Unknown extensions are silently skipped.

  • pathstrRoot directory to scan.
  • recursiveboolWhether to descend into subdirectories. Default True.
  • globstrGlob pattern to filter files. Default "**/*".

Returns  list[Document]

register(extension, loader_class) method

Register a custom loader class for a file extension. Overrides any existing registration for that extension.

  • extensionstrFile extension including the leading dot, e.g. ".xyz".
  • loader_classtype[BaseLoader]The loader class (not an instance) to register.

Direct Loader Usage

You can also instantiate loaders directly, which is useful when you need access to loader-specific options:

Python
from cognity_ai.loaders import PdfLoader, DocxLoader, CsvLoader

# PDF — extract text and images
pdf_loader = PdfLoader(extract_images=True, ocr_fallback=True)
docs = pdf_loader.load("annual_report.pdf")
doc = docs[0]
print(doc.metadata["page_count"])   # → 42
print(len(doc.page_map))            # → 42
print(len(doc.image_refs))          # → number of embedded images

# DOCX — headings and tables
docx_loader = DocxLoader()
docs = docx_loader.load("spec.docx")

# CSV — custom delimiter
csv_loader = CsvLoader(delimiter=";")
docs = csv_loader.load("data.csv")
print(docs[0].metadata["headers"])   # → ["col1", "col2", …]

PDF Utilities

Lower-level helpers in cognity_ai.loaders.pdf_utils for programmatic PDF manipulation independent of the loader pipeline.

Python
from cognity_ai.loaders.pdf_utils import (
    extract_tables,   # → list[pd.DataFrame], one per page
    extract_images,   # → list[bytes], raw image bytes per page
    extract_metadata, # → dict  (author, title, created, page_count, …)
    slice_pages,      # → bytes  (sub-PDF from page range)
    merge_pdfs,       # → bytes  (merged single PDF)
    pdf_to_images,    # → list[PIL.Image]  (full-page raster for OCR)
)
FunctionSignatureReturnsDescription
extract_tables (path, pages=None) list[DataFrame] Extract all tables from the PDF as pandas DataFrames. Pass pages=[1,3] to limit to specific pages.
extract_images (path, pages=None) list[bytes] Return raw image bytes for every embedded image. Includes JPEG, PNG, and JBIG2 streams.
extract_metadata (path) dict Return document-level metadata: author, title, subject, creator, created, page_count.
slice_pages (path, start, end) bytes Extract pages startend (1-indexed, inclusive) as a new in-memory PDF. Write to disk with open(..., "wb").write(result).
merge_pdfs (paths) bytes Concatenate multiple PDFs in order. Accepts a list of file paths.
pdf_to_images (path, dpi=150) list[PIL.Image] Render each page as a PIL Image at the specified DPI. Useful for full-page OCR pipelines.

PDF Utilities Example

Python
from cognity_ai.loaders.pdf_utils import extract_tables, slice_pages, pdf_to_images

# Extract all tables from a financial report
tables = extract_tables("report.pdf")
for df in tables:
    print(df.to_string())

# Slice out the appendix (pages 40–55) as a new PDF
appendix_bytes = slice_pages("report.pdf", start=40, end=55)
with open("appendix.pdf", "wb") as f:
    f.write(appendix_bytes)

# Render pages as images for vision model processing
images = pdf_to_images("scanned_doc.pdf", dpi=300)
for i, img in enumerate(images):
    img.save(f"page_{i+1}.png")

Writing a Custom Loader

Subclass BaseLoader, implement the two abstract members, and register with LoaderFactory:

Python
from cognity_ai.loaders import BaseLoader, LoaderFactory
from cognity_ai.models import Document
import uuid

class EpubLoader(BaseLoader):

    @property
    def supported_extensions(self) -> list[str]:
        return [".epub"]

    def load(self, path: str) -> list[Document]:
        # parse the epub, extract text …
        text = _parse_epub(path)
        return [
            Document(
                id=str(uuid.uuid4()),
                text=text,
                metadata={"source": path},
                source_path=path,
                page_map=[],
                image_refs=[],
            )
        ]

# Register globally
factory = LoaderFactory()
factory.register(".epub", EpubLoader)