BaseChunker ABC

All chunkers implement BaseChunker. You can use the string key when constructing RAGLibrary, or import and instantiate a chunker directly.

from abc import ABC, abstractmethod
from cognity_ai.models import SemanticChunk, PageInfo

class BaseChunker(ABC):
    @abstractmethod
    def chunk(
        self,
        text: str,
        doc_id: str,
        pages: list[PageInfo] | None = None,
    ) -> list[SemanticChunk]:
        """
        Split text into chunks.

        Args:
            text:    Full document text.
            doc_id:  Source document identifier.
            pages:   Page boundary metadata from HybridPageIndex (optional).

        Returns:
            list[SemanticChunk] — ordered chunks with metadata.
        """
        ...

SemanticChunk Model

Every chunker returns a list of SemanticChunk dataclass instances:

from cognity_ai.models import SemanticChunk

@dataclass
class SemanticChunk:
    id: str                      # "{doc_id}_chunk_{n}"
    doc_id: str                  # parent document ID
    text: str                    # chunk text
    embedding: list[float] | None  # populated after embed_batch()
    metadata: dict               # page_number, section, char_start, char_end
    entities: list[str]          # entity names extracted from this chunk
    is_parent: bool              # True for parent chunks (parent_child strategy)
    parent_id: str | None        # set on child chunks pointing to their parent

All Strategies

sentence SentenceChunker DEFAULT

Splits on sentence boundaries detected by spaCy, then groups sentences into chunks within a token budget. Preserves complete sentences — no mid-sentence cuts. Best balance of quality and speed for most text.

  • max_tokens=512
  • overlap_sentences=1
  • from cognity_ai.chunkers import SentenceChunker
    
    chunker = SentenceChunker(max_tokens=512, overlap_sentences=1)
    chunks = chunker.chunk(text, doc_id="my_doc", pages=pages)
    print(chunks[0].text)        # complete sentence(s)
    print(chunks[0].metadata)    # {"page_number": 1, "char_start": 0, ...}
    fixed FixedChunker

    Splits text into fixed-size token windows with a configurable overlap. Fast and predictable. Chunks may cut mid-sentence. Good for structured or tabular text where sentence boundaries are less meaningful.

  • chunk_size=512
  • overlap=64
  • from cognity_ai.chunkers import FixedChunker
    
    chunker = FixedChunker(chunk_size=512, overlap=64)
    chunks = chunker.chunk(text, doc_id="my_doc")
    semantic SemanticChunker

    Groups consecutive sentences by embedding cosine similarity. A new chunk starts when similarity drops below the threshold. Produces topically coherent chunks but requires an embedder during ingestion — slower than sentence/fixed.

  • similarity_threshold=0.8
  • max_tokens=1024
  • from cognity_ai.chunkers import SemanticChunker
    from cognity_ai.embedders import GeminiEmbedder
    
    chunker = SemanticChunker(
        embedder=GeminiEmbedder(api_key="..."),
        similarity_threshold=0.8,
        max_tokens=1024,
    )
    chunks = chunker.chunk(text, doc_id="my_doc")
    recursive RecursiveChunker

    Recursively splits on a priority list of separators (\n\n, \n, . , ) until chunks fit within the token budget. Inspired by LangChain's recursive splitter. Good for markdown and code.

  • chunk_size=512
  • overlap=64
  • separators=["\n\n", "\n", ". ", " "]
  • from cognity_ai.chunkers import RecursiveChunker
    
    chunker = RecursiveChunker(
        chunk_size=512,
        overlap=64,
        separators=["\n\n", "\n", ". ", " "],
    )
    chunks = chunker.chunk(text, doc_id="my_doc")
    parent_child ParentChildChunker

    Produces two levels of chunks: small child chunks for retrieval (precise similarity matching) and large parent chunks returned as context to the LLM. Use with ParentChildRetriever. Best for long documents where you need precise retrieval but rich context.

  • child_size=256
  • parent_size=1024
  • child_overlap=32
  • from cognity_ai import RAGLibrary
    
    # Must pair chunker + retriever
    rag = RAGLibrary(chunker="parent_child", rag_method="parent_child")
    rag.ingest("long_report.pdf")
    answer = rag.query("What were the Q3 results?")
    hybrid HybridChunker

    Two-pass approach: sentence-boundary primary chunking followed by a semantic similarity regrouping pass. Merges adjacent chunks that are topically similar and splits those that diverge. Best quality; moderate speed overhead.

  • max_tokens=512
  • similarity_threshold=0.75
  • overlap_sentences=1
  • from cognity_ai import RAGLibrary
    
    rag = RAGLibrary(chunker="hybrid", embedder="openai", openai_api_key="sk-...")

    Comparison Table

    Key Retrieval Quality Context Coherence Speed Best For
    sentence Excellent Excellent Fast General purpose, default choice
    fixed Good Fair Fastest Structured/tabular text, CSV, JSON
    semantic Excellent Excellent Slow (embedding per sentence) High-quality ingestion, topic-dense corpora
    recursive Good Good Fast Markdown, code files, hierarchical text
    parent_child Excellent Excellent Moderate Long documents needing precise retrieval + broad context
    hybrid Best Best Moderate–slow Quality-critical pipelines with budget for two passes

    Configuration

    Set the chunker via the chunker string key or through LibraryConfig:

    from cognity_ai import RAGLibrary
    from cognity_ai.config import LibraryConfig
    
    # Via string key (simplest)
    rag = RAGLibrary(chunker="sentence")
    rag = RAGLibrary(chunker="parent_child", rag_method="parent_child")
    
    # Via LibraryConfig (full control)
    config = LibraryConfig(chunker="hybrid")
    rag = RAGLibrary(config=config)
    
    # Direct instantiation (advanced)
    from cognity_ai.chunkers import HybridChunker
    from cognity_ai.embedders import OpenAIEmbedder
    
    chunker = HybridChunker(
        embedder=OpenAIEmbedder(api_key="sk-..."),
        max_tokens=768,
        similarity_threshold=0.78,
    )
    chunks = chunker.chunk(document.text, doc_id="doc_1", pages=document.page_map)
    Tip: The semantic and hybrid chunkers require an embedder during chunking (not just during retrieval). Make sure the embedder is configured before calling ingest() with these strategies.

    Custom Chunker

    Implement BaseChunker and register it:

    from cognity_ai.chunkers.base import BaseChunker
    from cognity_ai.models import SemanticChunk, PageInfo
    from cognity_ai import RAGLibrary
    
    class ParagraphChunker(BaseChunker):
        """Split on double-newlines — simple paragraph-based chunking."""
    
        def chunk(
            self,
            text: str,
            doc_id: str,
            pages: list[PageInfo] | None = None,
        ) -> list[SemanticChunk]:
            paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
            return [
                SemanticChunk(
                    id=f"{doc_id}_chunk_{i}",
                    doc_id=doc_id,
                    text=para,
                    embedding=None,
                    metadata={"char_start": text.index(para)},
                    entities=[],
                    is_parent=False,
                    parent_id=None,
                )
                for i, para in enumerate(paragraphs)
            ]
    
    # Register and use
    rag = RAGLibrary()
    rag.register_chunker("paragraph", ParagraphChunker)
    
    # Or pass instance directly via config
    from cognity_ai.config import LibraryConfig
    config = LibraryConfig(chunker="paragraph")
    rag = RAGLibrary(config=config)
    rag.ingest("document.txt")