Documentation

Getting Started

Install cognity-ai, configure your providers, and run your first RAG query.

Installation

cognity-ai requires Python 3.11+. Choose an install profile that matches your stack, or install everything at once.

bash

# Gemini + Neo4j + ChromaDB + spaCy + all file loaders
pip install -e ".[default]"

bash

# OpenAI GPT + Qdrant + Neo4j + NLP + PDF + Office
pip install cognity-ai[openai,qdrant,neo4j,nlp,pdf,office]

bash

# SentenceTransformers + FAISS + NetworkX — no cloud required
pip install cognity-ai[sentence-transformers,faiss,networkx,nlp,pdf]

bash

# All providers, all stores, all formats
pip install -e ".[all]"

ℹ️

NLP model After installing any profile that includes nlp, download the spaCy English model:

python -m spacy download en_core_web_sm

Configuration

All configuration is passed directly to the RAGLibrary(...) constructor as keyword arguments. No YAML files, no environment boilerplate — just Python.

Parameter	Type	Default	Description
`llm`	`str`	`"gemini"`	LLM generator provider key
`embedder`	`str`	`"gemini"`	Embedding provider key
`vector_store`	`str`	`"chroma"`	Vector store backend key
`graph_store`	`str`	`"neo4j"`	Graph database backend key
`ocr`	`str`	`"gemini_vision"`	OCR provider for image text extraction
`rag_method`	`str`	`"hybrid_graph"`	Default retrieval strategy
`extraction`	`str`	`"hybrid"`	Entity/relation extraction mode (`nlp`, `llm`, `hybrid`)
`chunker`	`str`	`"sentence"`	Text chunking strategy
`page_index`	`str`	`"hybrid"`	Page number detection strategy

Configuration examples

python

from cognity-ai import RAGLibrary

# Default stack — Gemini + Neo4j + ChromaDB
rag = RAGLibrary()

# OpenAI stack
rag = RAGLibrary(
    llm="openai",
    embedder="openai",
    vector_store="qdrant",
    graph_store="neo4j",
)

# Fully local / offline stack
rag = RAGLibrary(
    llm="ollama",
    embedder="sentence_transformers",
    vector_store="faiss",
    graph_store="networkx",
    ocr="tesseract",
    rag_method="vector_only",
)

# Anthropic + Bedrock embeddings
rag = RAGLibrary(
    llm="anthropic",
    embedder="bedrock",
    vector_store="pinecone",
    graph_store="neo4j",
)

Your First RAG Pipeline

Five steps from zero to a working, production-ready RAG pipeline.

Import & Initialize

Create a RAGLibrary instance with your preferred providers.

python

from cognity-ai import RAGLibrary

rag = RAGLibrary()  # uses smart defaults

Ingest Documents

Load individual files or recursively ingest an entire directory.

python

# Individual files — any supported format
rag.ingest("report.pdf")
rag.ingest("data.xlsx")
rag.ingest("notes.docx")
rag.ingest("diagram.png")   # OCR'd automatically

# Batch ingest an entire directory (recursive)
rag.ingest_dir("./knowledge-base")

3
Build Communities (optional but recommended)

Community detection groups related entities into clusters, enabling global summarization queries and dramatically improving recall on broad questions.
python
```
rag.build_communities()  # Leiden algorithm via GDS
```

Query

Ask natural language questions. The default hybrid_graph method fuses 4 retrieval channels for maximum accuracy.

python

answer = rag.query("What are the main revenue drivers in Q3?")
print(answer)

# Per-query method override
answer = rag.query("Summarize all themes", method="multi_query")

Retrieve with Sources

Get structured results with source metadata, page numbers, confidence scores, and relevance ranks.

python

result = rag.retrieve("key financial metrics", top_k=5)

for chunk in result.chunks:
    print(chunk.text)
    print(f"Source: {chunk.source} | Page: {chunk.page} | Score: {chunk.score:.3f}")

File Format Support

cognity-ai ships with loaders for 14 file formats out of the box.

Format	Extensions	Loader	Key Features
Plain Text	`.txt`, `.md`	`TextLoader`	UTF-8, encoding detection, Markdown stripping
PDF	`.pdf`	`PDFLoader`	Text, tables, embedded images, page boundaries, OCR fallback
Word	`.docx`, `.doc`	`DocxLoader`	Paragraphs, tables, headers, embedded images auto-OCR'd
Excel	`.xlsx`, `.xls`	`ExcelLoader`	Multi-sheet, cell types, named ranges, formula values
CSV / TSV	`.csv`, `.tsv`	`CSVLoader`	Dialect detection, header inference, chunking by row count
PowerPoint	`.pptx`, `.ppt`	`PPTXLoader`	Slide text, speaker notes, embedded images
HTML	`.html`, `.htm`	`HTMLLoader`	Tag stripping, link extraction, heading hierarchy
JSON	`.json`	`JSONLoader`	Nested object flattening, JSON Lines, key-path labels
YAML	`.yaml`, `.yml`	`JSONLoader`	Parsed to dict, same flattening as JSON
Images	`.jpg`, `.png`, `.webp`, `.tiff`, `.bmp`	`ImageLoader`	Full OCR, multi-page TIFF, optional base64 embedding

💡

Embedded Images Images embedded inside DOCX, PDF, and PPTX files are automatically extracted, OCR'd with your configured provider, and injected back into the text at the correct character offset. No manual pre-processing required.

OCR Configuration

cognity-ai uses a fallback chain for OCR. If the primary provider fails (missing key, rate limit, unsupported format), it automatically tries the next in chain.

Provider	Key	Method	Best For	Install
Gemini Vision	`"gemini_vision"`	Multimodal LLM	Complex tables, mixed layouts	`google-generativeai`
OpenAI Vision	`"openai_vision"`	GPT-4o Vision	General purpose, high accuracy	`openai`
Claude Vision	`"anthropic_vision"`	Claude multimodal	Dense documents, reasoning	`anthropic`
Azure Vision	`"azure_vision"`	Azure AI Vision	Enterprise, compliance	`azure-ai-vision`
Bedrock Vision	`"bedrock_vision"`	Claude via Bedrock	AWS-native environments	`boto3`
Tesseract	`"tesseract"`	Local OCR engine	Offline, no API cost	`pytesseract`

python

# Set OCR provider at construction time
rag = RAGLibrary(ocr="openai_vision")

# Or override per-ingest for a specific file
rag.ingest("scanned_report.pdf", ocr="tesseract")

Knowledge Graph

During ingestion, cognity-ai extracts entities and relationships from every document and builds a knowledge graph. The graph enables structured traversal and global summarization that pure vector search cannot achieve.

Entity & Relation Extraction

The HybridExtractor runs spaCy NLP first (fast, local, free) and uses LLM augmentation only for semantic gaps that NLP misses — causal links, temporal dependencies, implicit associations.

Community Detection

Calling build_communities() runs the Leiden algorithm via Neo4j GDS to cluster related entities. Community summaries are stored and used as a high-level retrieval channel — crucial for broad "summarize everything about X" queries.

python

# Build entity communities after ingestion
rag.build_communities()

# Check graph health — entity counts, orphans, density
report = rag.health_report()
print(report)

# Detect conflicting facts in the knowledge graph
conflicts = rag.detect_conflicts()
for c in conflicts:
    print(f"Conflict: {c.description} (confidence: {c.confidence:.2f})")

Incremental Updates

cognity-ai computes a SHA-256 hash of every ingested file. On re-ingest, unchanged files are skipped with zero API calls. Only modified or new files are processed.

python

# Re-ingest a directory — only changed files are processed
rag.ingest_dir("./knowledge-base")

# Confirm a fact — promotes it to higher confidence
rag.confirm(entity_id="entity_123")

# Deprecate outdated knowledge — marks it as superseded
rag.deprecate(source="old_report_2023.pdf")

# Prune all deprecated or low-confidence knowledge
rag.prune(min_confidence=0.5)

⚠️

Prune is destructive rag.prune() permanently removes nodes and vectors from the stores. Run rag.health_report() first to review what will be removed.

Plugin System

Every component in cognity-ai is an ABC registered in the PluginRegistry. You can register custom loaders, embedders, chunkers, generators, and retrievers by string key — your code drops into the pipeline without touching any core files.

python

from cognity_ai.loaders.base import BaseLoader
from cognity_ai.registry import PluginRegistry
from cognity_ai.models.document import Document

class MyNotionLoader(BaseLoader):
    """Load pages from Notion via API."""

    def load(self, source: str) -> list[Document]:
        # Fetch from Notion API ...
        return [Document(text=page_text, source=source)]

# Register with a string key and supported extension
PluginRegistry.register_loader("notion", ".notion", MyNotionLoader)

# Now use it like any built-in loader
rag.ingest("page_id.notion")

ℹ️

Registering other components Use PluginRegistry.register_embedder(), register_retriever(), register_chunker(), and register_generator() for the other component types. All follow the same pattern.

Next Steps

You have the basics. Here's where to go next.

📖

Getting Started

Installation

Configuration

Configuration examples

Your First RAG Pipeline

Import & Initialize

Ingest Documents

Build Communities (optional but recommended)

Query

Retrieve with Sources

File Format Support

OCR Configuration

Knowledge Graph

Entity & Relation Extraction

Community Detection

Incremental Updates

Plugin System

Next Steps

API Reference

Architecture

Multimodal RAG

Changelog