OCR
Optical Character Recognition providers — extract text from images using multimodal LLMs or local Tesseract.
BaseOCR ABC
All OCR providers implement BaseOCR. LLM-based providers set supports_multimodal = True, enabling richer understanding of charts, tables, and handwriting.
from abc import ABC, abstractmethod
from pathlib import Path
class BaseOCR(ABC):
@abstractmethod
def ocr(self, image: str | bytes | Path) -> str:
"""
Extract text from an image.
Args:
image: File path (str or Path) or raw image bytes.
Returns:
Extracted text as a string.
"""
...
@property
def supports_multimodal(self) -> bool:
"""True for LLM-based providers that understand layout, tables, charts."""
return False
Fallback Chain
The OCRFactory selects the best available provider at init time. If the primary provider's API key is missing or unavailable, it automatically falls back:
ocr="tesseract" (or any key) in the RAGLibrary constructor to skip auto-detection entirely.
Providers
Uses Gemini 2.0 Flash's multimodal capability to read images. Excellent at complex layouts, multi-column text, tables, charts, handwriting, and mixed-language content. Free tier available with Google AI Studio key.
Uses GPT-4o vision. High accuracy on structured documents, forms, and printed text. Also handles diagrams and technical figures well.
Uses Claude 3.5 Sonnet vision. Strong at document understanding with contextual awareness — particularly good at preserving semantic structure (headings, lists, captions).
Uses Azure OpenAI GPT-4o vision endpoint. Identical model quality to openai_vision but routed through your Azure subscription — preferred for enterprise Azure environments with data residency requirements.
Uses Claude via AWS Bedrock multimodal API. Uses IAM credentials — no explicit API key needed. Best for AWS-native deployments where all traffic must stay within AWS.
Local OCR using pytesseract (wrapper for the Tesseract engine). No API required — fully offline. Best accuracy on clean, printed text. Struggles with complex layouts, handwriting, or low-resolution scans. Requires Tesseract system binary.
Provider Table
| Key | Class | Install | Method | Offline? | Best For |
|---|---|---|---|---|---|
gemini_vision |
GeminiVisionOCR |
cognity-ai[gemini] |
Gemini 2.0 Flash multimodal | No | Complex layouts, tables, handwriting |
openai_vision |
OpenAIVisionOCR |
cognity-ai[openai] |
GPT-4o vision | No | High accuracy, structured forms |
anthropic_vision |
AnthropicVisionOCR |
cognity-ai[anthropic] |
Claude 3.5 Sonnet vision | No | Document understanding, context-aware |
azure_vision |
AzureVisionOCR |
cognity-ai[azure] |
Azure GPT-4o vision | No | Enterprise Azure, data residency |
bedrock_vision |
BedrockVisionOCR |
cognity-ai[bedrock] |
AWS Bedrock Claude multimodal | No | AWS-native, IAM-based auth |
tesseract |
TesseractOCR |
cognity-ai[ocr-local] |
pytesseract (local) | Yes | No API, fully offline, printed text |
Image Pipeline
OCR is triggered in two scenarios:
1. Direct image ingestion
When you call rag.ingest("photo.jpg"), the ImageLoader routes the file through the configured OCR provider. The extracted text becomes the document's content and flows into the normal chunking + embedding pipeline.
# Ingest images directly
rag.ingest("scanned_invoice.png")
rag.ingest("whiteboard_photo.jpg")
rag.ingest("receipt.webp")
# Batch ingest a folder with mixed formats
rag.ingest_dir("./uploads/") # handles .jpg, .png, .pdf, .docx, etc.
2. Embedded images in documents
When PDF, DOCX, or PPTX loaders extract embedded images, the OCR subsystem processes each one and injects the resulting text back into the document at the image's position:
# The pipeline handles this automatically
rag.ingest("report_with_charts.pdf")
# ↳ PdfLoader extracts text + embedded image bytes
# ↳ Each image → OCRFactory → extracted text
# ↳ OCR text injected at image position in document
# ↳ Full text (original + OCR) → chunker → embedder → stores
.jpg, .jpeg, .png, .bmp, .tiff, .webp, .gif
Configuration
from cognity_ai import RAGLibrary
# Use Gemini Vision (default — needs GEMINI_API_KEY or explicit key)
rag = RAGLibrary(ocr="gemini_vision", gemini_api_key="AIza...")
# Use Claude Vision
rag = RAGLibrary(ocr="anthropic_vision", anthropic_api_key="sk-ant-...")
# Use GPT-4o vision
rag = RAGLibrary(ocr="openai_vision", openai_api_key="sk-...")
# Fully offline — no API calls for OCR
rag = RAGLibrary(ocr="tesseract")
# Requires: pip install cognity-ai[ocr-local]
# Requires Tesseract binary:
# Ubuntu/Debian: sudo apt install tesseract-ocr
# macOS: brew install tesseract
# Windows: https://github.com/UB-Mannheim/tesseract/wiki
# AWS Bedrock (uses boto3 credential chain)
rag = RAGLibrary(ocr="bedrock_vision", aws_region="us-east-1")
# Azure OpenAI Vision
from cognity_ai.config import LibraryConfig, AzureOpenAIConfig
config = LibraryConfig(
ocr="azure_vision",
azure_openai=AzureOpenAIConfig(
endpoint="https://myinstance.openai.azure.com/",
api_key="...",
deployment_name="gpt-4o",
),
)
rag = RAGLibrary(config=config)
Direct Usage
Use OCR providers directly without RAGLibrary:
from cognity_ai.ocr import GeminiVisionOCR, TesseractOCR
from cognity_ai.ocr.factory import OCRFactory
from pathlib import Path
# Direct provider
ocr = GeminiVisionOCR(api_key="AIza...")
text = ocr.ocr("invoice.jpg") # file path string
text = ocr.ocr(Path("scan.png")) # Path object
text = ocr.ocr(image_bytes) # bytes also accepted
print(ocr.supports_multimodal) # True
# Local Tesseract
ocr = TesseractOCR(lang="eng")
text = ocr.ocr("document.tiff")
print(ocr.supports_multimodal) # False
# Auto-select best available
factory = OCRFactory(
gemini_api_key="AIza...", # will be tried first
openai_api_key=None, # skipped — no key
)
ocr = factory.get_ocr() # returns GeminiVisionOCR
text = ocr.ocr("photo.jpg")
pip install cognity-ai[ocr-local] installs pytesseract (Python wrapper) but you must also install the Tesseract system binary separately. Without the binary, TesseractOCR will raise a TesseractNotFoundError.