BaseOCR ABC

All OCR providers implement BaseOCR. LLM-based providers set supports_multimodal = True, enabling richer understanding of charts, tables, and handwriting.

from abc import ABC, abstractmethod
from pathlib import Path

class BaseOCR(ABC):
    @abstractmethod
    def ocr(self, image: str | bytes | Path) -> str:
        """
        Extract text from an image.

        Args:
            image: File path (str or Path) or raw image bytes.

        Returns:
            Extracted text as a string.
        """
        ...

    @property
    def supports_multimodal(self) -> bool:
        """True for LLM-based providers that understand layout, tables, charts."""
        return False

Fallback Chain

The OCRFactory selects the best available provider at init time. If the primary provider's API key is missing or unavailable, it automatically falls back:

gemini_vision Gemini 2.0 Flash multimodal DEFAULT
↓ if unavailable
openai_vision GPT-4o vision API
↓ if unavailable
anthropic_vision Claude 3.5 Sonnet vision API
↓ if unavailable
tesseract Local pytesseract (always available if installed) LOCAL
Note: You can pin a specific provider with ocr="tesseract" (or any key) in the RAGLibrary constructor to skip auto-detection entirely.

Providers

gemini_vision GeminiVisionOCR DEFAULT cognity-ai[gemini]

Uses Gemini 2.0 Flash's multimodal capability to read images. Excellent at complex layouts, multi-column text, tables, charts, handwriting, and mixed-language content. Free tier available with Google AI Studio key.

openai_vision OpenAIVisionOCR API cognity-ai[openai]

Uses GPT-4o vision. High accuracy on structured documents, forms, and printed text. Also handles diagrams and technical figures well.

anthropic_vision AnthropicVisionOCR API cognity-ai[anthropic]

Uses Claude 3.5 Sonnet vision. Strong at document understanding with contextual awareness — particularly good at preserving semantic structure (headings, lists, captions).

azure_vision AzureVisionOCR API cognity-ai[azure]

Uses Azure OpenAI GPT-4o vision endpoint. Identical model quality to openai_vision but routed through your Azure subscription — preferred for enterprise Azure environments with data residency requirements.

bedrock_vision BedrockVisionOCR API cognity-ai[bedrock]

Uses Claude via AWS Bedrock multimodal API. Uses IAM credentials — no explicit API key needed. Best for AWS-native deployments where all traffic must stay within AWS.

tesseract TesseractOCR LOCAL cognity-ai[ocr-local]

Local OCR using pytesseract (wrapper for the Tesseract engine). No API required — fully offline. Best accuracy on clean, printed text. Struggles with complex layouts, handwriting, or low-resolution scans. Requires Tesseract system binary.

Provider Table

KeyClassInstallMethodOffline?Best For
gemini_vision GeminiVisionOCR cognity-ai[gemini] Gemini 2.0 Flash multimodal No Complex layouts, tables, handwriting
openai_vision OpenAIVisionOCR cognity-ai[openai] GPT-4o vision No High accuracy, structured forms
anthropic_vision AnthropicVisionOCR cognity-ai[anthropic] Claude 3.5 Sonnet vision No Document understanding, context-aware
azure_vision AzureVisionOCR cognity-ai[azure] Azure GPT-4o vision No Enterprise Azure, data residency
bedrock_vision BedrockVisionOCR cognity-ai[bedrock] AWS Bedrock Claude multimodal No AWS-native, IAM-based auth
tesseract TesseractOCR cognity-ai[ocr-local] pytesseract (local) Yes No API, fully offline, printed text

Image Pipeline

OCR is triggered in two scenarios:

1. Direct image ingestion

When you call rag.ingest("photo.jpg"), the ImageLoader routes the file through the configured OCR provider. The extracted text becomes the document's content and flows into the normal chunking + embedding pipeline.

# Ingest images directly
rag.ingest("scanned_invoice.png")
rag.ingest("whiteboard_photo.jpg")
rag.ingest("receipt.webp")

# Batch ingest a folder with mixed formats
rag.ingest_dir("./uploads/")  # handles .jpg, .png, .pdf, .docx, etc.

2. Embedded images in documents

When PDF, DOCX, or PPTX loaders extract embedded images, the OCR subsystem processes each one and injects the resulting text back into the document at the image's position:

# The pipeline handles this automatically
rag.ingest("report_with_charts.pdf")
# ↳ PdfLoader extracts text + embedded image bytes
# ↳ Each image → OCRFactory → extracted text
# ↳ OCR text injected at image position in document
# ↳ Full text (original + OCR) → chunker → embedder → stores
Supported image formats: .jpg, .jpeg, .png, .bmp, .tiff, .webp, .gif

Configuration

from cognity_ai import RAGLibrary

# Use Gemini Vision (default — needs GEMINI_API_KEY or explicit key)
rag = RAGLibrary(ocr="gemini_vision", gemini_api_key="AIza...")

# Use Claude Vision
rag = RAGLibrary(ocr="anthropic_vision", anthropic_api_key="sk-ant-...")

# Use GPT-4o vision
rag = RAGLibrary(ocr="openai_vision", openai_api_key="sk-...")

# Fully offline — no API calls for OCR
rag = RAGLibrary(ocr="tesseract")
# Requires: pip install cognity-ai[ocr-local]
# Requires Tesseract binary:
#   Ubuntu/Debian: sudo apt install tesseract-ocr
#   macOS:         brew install tesseract
#   Windows:       https://github.com/UB-Mannheim/tesseract/wiki

# AWS Bedrock (uses boto3 credential chain)
rag = RAGLibrary(ocr="bedrock_vision", aws_region="us-east-1")

# Azure OpenAI Vision
from cognity_ai.config import LibraryConfig, AzureOpenAIConfig
config = LibraryConfig(
    ocr="azure_vision",
    azure_openai=AzureOpenAIConfig(
        endpoint="https://myinstance.openai.azure.com/",
        api_key="...",
        deployment_name="gpt-4o",
    ),
)
rag = RAGLibrary(config=config)

Direct Usage

Use OCR providers directly without RAGLibrary:

from cognity_ai.ocr import GeminiVisionOCR, TesseractOCR
from cognity_ai.ocr.factory import OCRFactory
from pathlib import Path

# Direct provider
ocr = GeminiVisionOCR(api_key="AIza...")
text = ocr.ocr("invoice.jpg")           # file path string
text = ocr.ocr(Path("scan.png"))        # Path object
text = ocr.ocr(image_bytes)             # bytes also accepted
print(ocr.supports_multimodal)          # True

# Local Tesseract
ocr = TesseractOCR(lang="eng")
text = ocr.ocr("document.tiff")
print(ocr.supports_multimodal)          # False

# Auto-select best available
factory = OCRFactory(
    gemini_api_key="AIza...",           # will be tried first
    openai_api_key=None,                # skipped — no key
)
ocr = factory.get_ocr()                 # returns GeminiVisionOCR
text = ocr.ocr("photo.jpg")
Tesseract binary required: pip install cognity-ai[ocr-local] installs pytesseract (Python wrapper) but you must also install the Tesseract system binary separately. Without the binary, TesseractOCR will raise a TesseractNotFoundError.