Loaders
Document loaders for every major file format, plus the BaseLoader ABC for custom implementations.
BaseLoader
All loaders implement the BaseLoader abstract base class. You can subclass it to add support for custom file formats.
from abc import ABC, abstractmethod
from cognity_ai.models import Document
class BaseLoader(ABC):
@abstractmethod
def load(self, path: str) -> list[Document]: ...
@property
@abstractmethod
def supported_extensions(self) -> list[str]: ...
Load and parse a file at path, returning a list of Document objects. Multi-page formats like PDF return one Document per file (not per page) with a populated page_map.
- path str Absolute or relative path to the source file.
Returns list[Document] — typically one element; some loaders (e.g. ExcelLoader) may return one per sheet.
Returns the list of file extensions this loader handles, each including the leading dot (e.g. [".pdf"]). Used by LoaderFactory to auto-select the correct loader.
Returns list[str]
Document Model
The universal document representation returned by all loaders. Downstream pipeline stages (chunkers, embedders, extractors) operate on this type.
from cognity_ai.models import Document, PageInfo, ImageRef
@dataclass
class Document:
id: str # UUID — auto-generated on load
text: str # Full extracted plain text
metadata: dict # source, title, page_count, author, …
source_path: str # Absolute path to the original file
page_map: list[PageInfo] # Per-page character offsets + headings
image_refs: list[ImageRef] # Embedded image positions in the text
@dataclass
class PageInfo:
page_number: int
char_start: int
char_end: int
headings: list[str]
@dataclass
class ImageRef:
image_id: str # Unique ID within the document
char_offset: int # Position in text where the image is anchored
page_number: int
caption: str # OCR-extracted or explicit caption
| Field | Type | Description |
|---|---|---|
id | str | UUID generated at load time. Stable for the same file path. |
text | str | Full extracted plain text, normalised to UTF-8 with page boundaries marked by \n\n. |
metadata | dict | Loader-specific metadata. Common keys: source, title, author, page_count, created, encoding. |
source_path | str | Absolute path to the original file on disk. |
page_map | list[PageInfo] | One entry per page. Maps page numbers to character offsets in text. |
image_refs | list[ImageRef] | References to embedded images with their text anchor positions and captions. |
Supported Formats
cognity-ai ships loaders for 11 file format families. All are auto-selected by LoaderFactory based on file extension.
| Loader Class | Extensions | Key Dependency | Metadata Extracted |
|---|---|---|---|
TxtLoader |
.txt | stdlib | filename, encoding, line count |
MdLoader |
.md .markdown | stdlib | H1–H6 headings as sections |
PdfLoader |
pdfplumber, pypdf | page numbers, headings, tables, embedded images | |
DocxLoader |
.docx .doc | python-docx | headings, tables, page breaks, embedded images |
ExcelLoader |
.xlsx .xls | openpyxl, pandas | sheet names, row-to-text conversion, formula values |
CsvLoader |
.csv .tsv | pandas | headers, delimiter auto-detect |
PptxLoader |
.pptx .ppt | python-pptx | slide numbers, speaker notes, embedded images |
HtmlLoader |
.html .htm | beautifulsoup4 | title, headings, links |
JsonLoader |
.json | stdlib | key structure, depth |
YamlLoader |
.yaml .yml | PyYAML | key structure |
ImageLoader |
.jpg .jpeg .png .bmp .tiff .webp .gif | OCR subsystem | filename, OCR confidence score |
ImageLoader delegates to the OCR backend configured in LibraryConfig.ocr. Set ocr="gemini_vision", "aws_textract", "google_vision", or "tesseract" depending on your deployment. For scanned PDFs, PdfLoader automatically invokes the same OCR backend when native text extraction yields empty pages.
LoaderFactory
The recommended way to get a loader. LoaderFactory maintains an extension-to-loader registry and auto-selects the correct loader for any given file path.
from cognity_ai.loaders import LoaderFactory
factory = LoaderFactory()
# Auto-detect from extension — returns the appropriate BaseLoader subclass
loader = factory.get_loader("report.pdf")
docs = loader.load("report.pdf")
# Load a directory — recurses and dispatches per-file
all_docs = factory.load_directory("./documents/")
# Register a custom loader for a proprietary format
factory.register(".myext", MyCustomLoader)
Look up and return the loader registered for the extension of path.
- pathstrFile path or filename. Only the extension is used for lookup.
Returns BaseLoader Raises UnsupportedFormatError if no loader is registered for the extension.
Walk a directory and load all files whose extensions are registered. Unknown extensions are silently skipped.
- pathstrRoot directory to scan.
- recursiveboolWhether to descend into subdirectories. Default
True. - globstrGlob pattern to filter files. Default
"**/*".
Returns list[Document]
Register a custom loader class for a file extension. Overrides any existing registration for that extension.
- extensionstrFile extension including the leading dot, e.g.
".xyz". - loader_classtype[BaseLoader]The loader class (not an instance) to register.
Direct Loader Usage
You can also instantiate loaders directly, which is useful when you need access to loader-specific options:
from cognity_ai.loaders import PdfLoader, DocxLoader, CsvLoader
# PDF — extract text and images
pdf_loader = PdfLoader(extract_images=True, ocr_fallback=True)
docs = pdf_loader.load("annual_report.pdf")
doc = docs[0]
print(doc.metadata["page_count"]) # → 42
print(len(doc.page_map)) # → 42
print(len(doc.image_refs)) # → number of embedded images
# DOCX — headings and tables
docx_loader = DocxLoader()
docs = docx_loader.load("spec.docx")
# CSV — custom delimiter
csv_loader = CsvLoader(delimiter=";")
docs = csv_loader.load("data.csv")
print(docs[0].metadata["headers"]) # → ["col1", "col2", …]
PDF Utilities
Lower-level helpers in cognity_ai.loaders.pdf_utils for programmatic PDF manipulation independent of the loader pipeline.
from cognity_ai.loaders.pdf_utils import (
extract_tables, # → list[pd.DataFrame], one per page
extract_images, # → list[bytes], raw image bytes per page
extract_metadata, # → dict (author, title, created, page_count, …)
slice_pages, # → bytes (sub-PDF from page range)
merge_pdfs, # → bytes (merged single PDF)
pdf_to_images, # → list[PIL.Image] (full-page raster for OCR)
)
| Function | Signature | Returns | Description |
|---|---|---|---|
extract_tables |
(path, pages=None) |
list[DataFrame] |
Extract all tables from the PDF as pandas DataFrames. Pass pages=[1,3] to limit to specific pages. |
extract_images |
(path, pages=None) |
list[bytes] |
Return raw image bytes for every embedded image. Includes JPEG, PNG, and JBIG2 streams. |
extract_metadata |
(path) |
dict |
Return document-level metadata: author, title, subject, creator, created, page_count. |
slice_pages |
(path, start, end) |
bytes |
Extract pages start–end (1-indexed, inclusive) as a new in-memory PDF. Write to disk with open(..., "wb").write(result). |
merge_pdfs |
(paths) |
bytes |
Concatenate multiple PDFs in order. Accepts a list of file paths. |
pdf_to_images |
(path, dpi=150) |
list[PIL.Image] |
Render each page as a PIL Image at the specified DPI. Useful for full-page OCR pipelines. |
PDF Utilities Example
from cognity_ai.loaders.pdf_utils import extract_tables, slice_pages, pdf_to_images
# Extract all tables from a financial report
tables = extract_tables("report.pdf")
for df in tables:
print(df.to_string())
# Slice out the appendix (pages 40–55) as a new PDF
appendix_bytes = slice_pages("report.pdf", start=40, end=55)
with open("appendix.pdf", "wb") as f:
f.write(appendix_bytes)
# Render pages as images for vision model processing
images = pdf_to_images("scanned_doc.pdf", dpi=300)
for i, img in enumerate(images):
img.save(f"page_{i+1}.png")
Writing a Custom Loader
Subclass BaseLoader, implement the two abstract members, and register with LoaderFactory:
from cognity_ai.loaders import BaseLoader, LoaderFactory
from cognity_ai.models import Document
import uuid
class EpubLoader(BaseLoader):
@property
def supported_extensions(self) -> list[str]:
return [".epub"]
def load(self, path: str) -> list[Document]:
# parse the epub, extract text …
text = _parse_epub(path)
return [
Document(
id=str(uuid.uuid4()),
text=text,
metadata={"source": path},
source_path=path,
page_map=[],
image_refs=[],
)
]
# Register globally
factory = LoaderFactory()
factory.register(".epub", EpubLoader)