Chunkers — cognity-ai API

BaseChunker ABC

All chunkers implement BaseChunker. You can use the string key when constructing RAGLibrary, or import and instantiate a chunker directly.

from abc import ABC, abstractmethod
from cognity_ai.models import SemanticChunk, PageInfo

class BaseChunker(ABC):
    @abstractmethod
    def chunk(
        self,
        text: str,
        doc_id: str,
        pages: list[PageInfo] | None = None,
    ) -> list[SemanticChunk]:
        """
        Split text into chunks.

        Args:
            text:    Full document text.
            doc_id:  Source document identifier.
            pages:   Page boundary metadata from HybridPageIndex (optional).

        Returns:
            list[SemanticChunk] — ordered chunks with metadata.
        """
        ...

SemanticChunk Model

Every chunker returns a list of SemanticChunk dataclass instances:

from cognity_ai.models import SemanticChunk

@dataclass
class SemanticChunk:
    id: str                      # "{doc_id}_chunk_{n}"
    doc_id: str                  # parent document ID
    text: str                    # chunk text
    embedding: list[float] | None  # populated after embed_batch()
    metadata: dict               # page_number, section, char_start, char_end
    entities: list[str]          # entity names extracted from this chunk
    is_parent: bool              # True for parent chunks (parent_child strategy)
    parent_id: str | None        # set on child chunks pointing to their parent

All Strategies

sentence SentenceChunker DEFAULT

Splits on sentence boundaries detected by spaCy, then groups sentences into chunks within a token budget. Preserves complete sentences — no mid-sentence cuts. Best balance of quality and speed for most text.

max_tokens=512

overlap_sentences=1

from cognity_ai.chunkers import SentenceChunker

chunker = SentenceChunker(max_tokens=512, overlap_sentences=1)
chunks = chunker.chunk(text, doc_id="my_doc", pages=pages)
print(chunks[0].text)        # complete sentence(s)
print(chunks[0].metadata)    # {"page_number": 1, "char_start": 0, ...}

fixed FixedChunker

Splits text into fixed-size token windows with a configurable overlap. Fast and predictable. Chunks may cut mid-sentence. Good for structured or tabular text where sentence boundaries are less meaningful.

chunk_size=512

overlap=64

from cognity_ai.chunkers import FixedChunker

chunker = FixedChunker(chunk_size=512, overlap=64)
chunks = chunker.chunk(text, doc_id="my_doc")

semantic SemanticChunker

Groups consecutive sentences by embedding cosine similarity. A new chunk starts when similarity drops below the threshold. Produces topically coherent chunks but requires an embedder during ingestion — slower than sentence/fixed.

similarity_threshold=0.8

max_tokens=1024

from cognity_ai.chunkers import SemanticChunker
from cognity_ai.embedders import GeminiEmbedder

chunker = SemanticChunker(
    embedder=GeminiEmbedder(api_key="..."),
    similarity_threshold=0.8,
    max_tokens=1024,
)
chunks = chunker.chunk(text, doc_id="my_doc")

recursive RecursiveChunker

Recursively splits on a priority list of separators (\n\n, \n, . , ) until chunks fit within the token budget. Inspired by LangChain's recursive splitter. Good for markdown and code.

chunk_size=512

overlap=64

separators=["\n\n", "\n", ". ", " "]

from cognity_ai.chunkers import RecursiveChunker

chunker = RecursiveChunker(
    chunk_size=512,
    overlap=64,
    separators=["\n\n", "\n", ". ", " "],
)
chunks = chunker.chunk(text, doc_id="my_doc")

parent_child ParentChildChunker

Produces two levels of chunks: small child chunks for retrieval (precise similarity matching) and large parent chunks returned as context to the LLM. Use with ParentChildRetriever. Best for long documents where you need precise retrieval but rich context.

child_size=256

parent_size=1024

child_overlap=32

from cognity_ai import RAGLibrary

# Must pair chunker + retriever
rag = RAGLibrary(chunker="parent_child", rag_method="parent_child")
rag.ingest("long_report.pdf")
answer = rag.query("What were the Q3 results?")

hybrid HybridChunker

Two-pass approach: sentence-boundary primary chunking followed by a semantic similarity regrouping pass. Merges adjacent chunks that are topically similar and splits those that diverge. Best quality; moderate speed overhead.

max_tokens=512

similarity_threshold=0.75

overlap_sentences=1

from cognity_ai import RAGLibrary

rag = RAGLibrary(chunker="hybrid", embedder="openai", openai_api_key="sk-...")

Comparison Table

Key	Retrieval Quality	Context Coherence	Speed	Best For
`sentence`	Excellent	Excellent	Fast	General purpose, default choice
`fixed`	Good	Fair	Fastest	Structured/tabular text, CSV, JSON
`semantic`	Excellent	Excellent	Slow (embedding per sentence)	High-quality ingestion, topic-dense corpora
`recursive`	Good	Good	Fast	Markdown, code files, hierarchical text
`parent_child`	Excellent	Excellent	Moderate	Long documents needing precise retrieval + broad context
`hybrid`	Best	Best	Moderate–slow	Quality-critical pipelines with budget for two passes

Configuration

Set the chunker via the chunker string key or through LibraryConfig:

from cognity_ai import RAGLibrary
from cognity_ai.config import LibraryConfig

# Via string key (simplest)
rag = RAGLibrary(chunker="sentence")
rag = RAGLibrary(chunker="parent_child", rag_method="parent_child")

# Via LibraryConfig (full control)
config = LibraryConfig(chunker="hybrid")
rag = RAGLibrary(config=config)

# Direct instantiation (advanced)
from cognity_ai.chunkers import HybridChunker
from cognity_ai.embedders import OpenAIEmbedder

chunker = HybridChunker(
    embedder=OpenAIEmbedder(api_key="sk-..."),
    max_tokens=768,
    similarity_threshold=0.78,
)
chunks = chunker.chunk(document.text, doc_id="doc_1", pages=document.page_map)

Tip: The semantic and hybrid chunkers require an embedder during chunking (not just during retrieval). Make sure the embedder is configured before calling ingest() with these strategies.

Custom Chunker

Implement BaseChunker and register it:

from cognity_ai.chunkers.base import BaseChunker
from cognity_ai.models import SemanticChunk, PageInfo
from cognity_ai import RAGLibrary

class ParagraphChunker(BaseChunker):
    """Split on double-newlines — simple paragraph-based chunking."""

    def chunk(
        self,
        text: str,
        doc_id: str,
        pages: list[PageInfo] | None = None,
    ) -> list[SemanticChunk]:
        paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
        return [
            SemanticChunk(
                id=f"{doc_id}_chunk_{i}",
                doc_id=doc_id,
                text=para,
                embedding=None,
                metadata={"char_start": text.index(para)},
                entities=[],
                is_parent=False,
                parent_id=None,
            )
            for i, para in enumerate(paragraphs)
        ]

# Register and use
rag = RAGLibrary()
rag.register_chunker("paragraph", ParagraphChunker)

# Or pass instance directly via config
from cognity_ai.config import LibraryConfig
config = LibraryConfig(chunker="paragraph")
rag = RAGLibrary(config=config)
rag.ingest("document.txt")