Getting Started

Install cognity-ai, configure your providers, and run your first RAG query.

Installation

cognity-ai requires Python 3.11+. Choose an install profile that matches your stack, or install everything at once.

bash
# Gemini + Neo4j + ChromaDB + spaCy + all file loaders
pip install -e ".[default]"
bash
# OpenAI GPT + Qdrant + Neo4j + NLP + PDF + Office
pip install cognity-ai[openai,qdrant,neo4j,nlp,pdf,office]
bash
# SentenceTransformers + FAISS + NetworkX — no cloud required
pip install cognity-ai[sentence-transformers,faiss,networkx,nlp,pdf]
bash
# All providers, all stores, all formats
pip install -e ".[all]"
ℹ️
NLP model After installing any profile that includes nlp, download the spaCy English model:
python -m spacy download en_core_web_sm

Configuration

All configuration is passed directly to the RAGLibrary(...) constructor as keyword arguments. No YAML files, no environment boilerplate — just Python.

Parameter Type Default Description
llmstr"gemini"LLM generator provider key
embedderstr"gemini"Embedding provider key
vector_storestr"chroma"Vector store backend key
graph_storestr"neo4j"Graph database backend key
ocrstr"gemini_vision"OCR provider for image text extraction
rag_methodstr"hybrid_graph"Default retrieval strategy
extractionstr"hybrid"Entity/relation extraction mode (nlp, llm, hybrid)
chunkerstr"sentence"Text chunking strategy
page_indexstr"hybrid"Page number detection strategy

Configuration examples

python
from cognity-ai import RAGLibrary

# Default stack — Gemini + Neo4j + ChromaDB
rag = RAGLibrary()

# OpenAI stack
rag = RAGLibrary(
    llm="openai",
    embedder="openai",
    vector_store="qdrant",
    graph_store="neo4j",
)

# Fully local / offline stack
rag = RAGLibrary(
    llm="ollama",
    embedder="sentence_transformers",
    vector_store="faiss",
    graph_store="networkx",
    ocr="tesseract",
    rag_method="vector_only",
)

# Anthropic + Bedrock embeddings
rag = RAGLibrary(
    llm="anthropic",
    embedder="bedrock",
    vector_store="pinecone",
    graph_store="neo4j",
)

Your First RAG Pipeline

Five steps from zero to a working, production-ready RAG pipeline.

  1. 1

    Import & Initialize

    Create a RAGLibrary instance with your preferred providers.

    python
    from cognity-ai import RAGLibrary
    
    rag = RAGLibrary()  # uses smart defaults
  2. 2

    Ingest Documents

    Load individual files or recursively ingest an entire directory.

    python
    # Individual files — any supported format
    rag.ingest("report.pdf")
    rag.ingest("data.xlsx")
    rag.ingest("notes.docx")
    rag.ingest("diagram.png")   # OCR'd automatically
    
    # Batch ingest an entire directory (recursive)
    rag.ingest_dir("./knowledge-base")
  3. 3

    Build Communities (optional but recommended)

    Community detection groups related entities into clusters, enabling global summarization queries and dramatically improving recall on broad questions.

    python
    rag.build_communities()  # Leiden algorithm via GDS
  4. 4

    Query

    Ask natural language questions. The default hybrid_graph method fuses 4 retrieval channels for maximum accuracy.

    python
    answer = rag.query("What are the main revenue drivers in Q3?")
    print(answer)
    
    # Per-query method override
    answer = rag.query("Summarize all themes", method="multi_query")
  5. 5

    Retrieve with Sources

    Get structured results with source metadata, page numbers, confidence scores, and relevance ranks.

    python
    result = rag.retrieve("key financial metrics", top_k=5)
    
    for chunk in result.chunks:
        print(chunk.text)
        print(f"Source: {chunk.source} | Page: {chunk.page} | Score: {chunk.score:.3f}")

File Format Support

cognity-ai ships with loaders for 14 file formats out of the box.

Format Extensions Loader Key Features
Plain Text.txt, .mdTextLoaderUTF-8, encoding detection, Markdown stripping
PDF.pdfPDFLoaderText, tables, embedded images, page boundaries, OCR fallback
Word.docx, .docDocxLoaderParagraphs, tables, headers, embedded images auto-OCR'd
Excel.xlsx, .xlsExcelLoaderMulti-sheet, cell types, named ranges, formula values
CSV / TSV.csv, .tsvCSVLoaderDialect detection, header inference, chunking by row count
PowerPoint.pptx, .pptPPTXLoaderSlide text, speaker notes, embedded images
HTML.html, .htmHTMLLoaderTag stripping, link extraction, heading hierarchy
JSON.jsonJSONLoaderNested object flattening, JSON Lines, key-path labels
YAML.yaml, .ymlJSONLoaderParsed to dict, same flattening as JSON
Images.jpg, .png, .webp, .tiff, .bmpImageLoaderFull OCR, multi-page TIFF, optional base64 embedding
💡
Embedded Images Images embedded inside DOCX, PDF, and PPTX files are automatically extracted, OCR'd with your configured provider, and injected back into the text at the correct character offset. No manual pre-processing required.

OCR Configuration

cognity-ai uses a fallback chain for OCR. If the primary provider fails (missing key, rate limit, unsupported format), it automatically tries the next in chain.

Provider Key Method Best For Install
Gemini Vision"gemini_vision"Multimodal LLMComplex tables, mixed layoutsgoogle-generativeai
OpenAI Vision"openai_vision"GPT-4o VisionGeneral purpose, high accuracyopenai
Claude Vision"anthropic_vision"Claude multimodalDense documents, reasoninganthropic
Azure Vision"azure_vision"Azure AI VisionEnterprise, complianceazure-ai-vision
Bedrock Vision"bedrock_vision"Claude via BedrockAWS-native environmentsboto3
Tesseract"tesseract"Local OCR engineOffline, no API costpytesseract
python
# Set OCR provider at construction time
rag = RAGLibrary(ocr="openai_vision")

# Or override per-ingest for a specific file
rag.ingest("scanned_report.pdf", ocr="tesseract")

Knowledge Graph

During ingestion, cognity-ai extracts entities and relationships from every document and builds a knowledge graph. The graph enables structured traversal and global summarization that pure vector search cannot achieve.

Entity & Relation Extraction

The HybridExtractor runs spaCy NLP first (fast, local, free) and uses LLM augmentation only for semantic gaps that NLP misses — causal links, temporal dependencies, implicit associations.

Community Detection

Calling build_communities() runs the Leiden algorithm via Neo4j GDS to cluster related entities. Community summaries are stored and used as a high-level retrieval channel — crucial for broad "summarize everything about X" queries.

python
# Build entity communities after ingestion
rag.build_communities()

# Check graph health — entity counts, orphans, density
report = rag.health_report()
print(report)

# Detect conflicting facts in the knowledge graph
conflicts = rag.detect_conflicts()
for c in conflicts:
    print(f"Conflict: {c.description} (confidence: {c.confidence:.2f})")

Incremental Updates

cognity-ai computes a SHA-256 hash of every ingested file. On re-ingest, unchanged files are skipped with zero API calls. Only modified or new files are processed.

python
# Re-ingest a directory — only changed files are processed
rag.ingest_dir("./knowledge-base")

# Confirm a fact — promotes it to higher confidence
rag.confirm(entity_id="entity_123")

# Deprecate outdated knowledge — marks it as superseded
rag.deprecate(source="old_report_2023.pdf")

# Prune all deprecated or low-confidence knowledge
rag.prune(min_confidence=0.5)
⚠️
Prune is destructive rag.prune() permanently removes nodes and vectors from the stores. Run rag.health_report() first to review what will be removed.

Plugin System

Every component in cognity-ai is an ABC registered in the PluginRegistry. You can register custom loaders, embedders, chunkers, generators, and retrievers by string key — your code drops into the pipeline without touching any core files.

python
from cognity_ai.loaders.base import BaseLoader
from cognity_ai.registry import PluginRegistry
from cognity_ai.models.document import Document

class MyNotionLoader(BaseLoader):
    """Load pages from Notion via API."""

    def load(self, source: str) -> list[Document]:
        # Fetch from Notion API ...
        return [Document(text=page_text, source=source)]

# Register with a string key and supported extension
PluginRegistry.register_loader("notion", ".notion", MyNotionLoader)

# Now use it like any built-in loader
rag.ingest("page_id.notion")
ℹ️
Registering other components Use PluginRegistry.register_embedder(), register_retriever(), register_chunker(), and register_generator() for the other component types. All follow the same pattern.

Next Steps

You have the basics. Here's where to go next.