Chunkers
Text chunking strategies — control how documents are split into retrievable pieces.
BaseChunker ABC
All chunkers implement BaseChunker. You can use the string key when constructing RAGLibrary, or import and instantiate a chunker directly.
from abc import ABC, abstractmethod
from cognity_ai.models import SemanticChunk, PageInfo
class BaseChunker(ABC):
@abstractmethod
def chunk(
self,
text: str,
doc_id: str,
pages: list[PageInfo] | None = None,
) -> list[SemanticChunk]:
"""
Split text into chunks.
Args:
text: Full document text.
doc_id: Source document identifier.
pages: Page boundary metadata from HybridPageIndex (optional).
Returns:
list[SemanticChunk] — ordered chunks with metadata.
"""
...
SemanticChunk Model
Every chunker returns a list of SemanticChunk dataclass instances:
from cognity_ai.models import SemanticChunk
@dataclass
class SemanticChunk:
id: str # "{doc_id}_chunk_{n}"
doc_id: str # parent document ID
text: str # chunk text
embedding: list[float] | None # populated after embed_batch()
metadata: dict # page_number, section, char_start, char_end
entities: list[str] # entity names extracted from this chunk
is_parent: bool # True for parent chunks (parent_child strategy)
parent_id: str | None # set on child chunks pointing to their parent
All Strategies
Splits on sentence boundaries detected by spaCy, then groups sentences into chunks within a token budget. Preserves complete sentences — no mid-sentence cuts. Best balance of quality and speed for most text.
from cognity_ai.chunkers import SentenceChunker
chunker = SentenceChunker(max_tokens=512, overlap_sentences=1)
chunks = chunker.chunk(text, doc_id="my_doc", pages=pages)
print(chunks[0].text) # complete sentence(s)
print(chunks[0].metadata) # {"page_number": 1, "char_start": 0, ...}
Splits text into fixed-size token windows with a configurable overlap. Fast and predictable. Chunks may cut mid-sentence. Good for structured or tabular text where sentence boundaries are less meaningful.
from cognity_ai.chunkers import FixedChunker
chunker = FixedChunker(chunk_size=512, overlap=64)
chunks = chunker.chunk(text, doc_id="my_doc")
Groups consecutive sentences by embedding cosine similarity. A new chunk starts when similarity drops below the threshold. Produces topically coherent chunks but requires an embedder during ingestion — slower than sentence/fixed.
from cognity_ai.chunkers import SemanticChunker
from cognity_ai.embedders import GeminiEmbedder
chunker = SemanticChunker(
embedder=GeminiEmbedder(api_key="..."),
similarity_threshold=0.8,
max_tokens=1024,
)
chunks = chunker.chunk(text, doc_id="my_doc")
Recursively splits on a priority list of separators (\n\n, \n, . , ) until chunks fit within the token budget. Inspired by LangChain's recursive splitter. Good for markdown and code.
from cognity_ai.chunkers import RecursiveChunker
chunker = RecursiveChunker(
chunk_size=512,
overlap=64,
separators=["\n\n", "\n", ". ", " "],
)
chunks = chunker.chunk(text, doc_id="my_doc")
Produces two levels of chunks: small child chunks for retrieval (precise similarity matching) and large parent chunks returned as context to the LLM. Use with ParentChildRetriever. Best for long documents where you need precise retrieval but rich context.
from cognity_ai import RAGLibrary
# Must pair chunker + retriever
rag = RAGLibrary(chunker="parent_child", rag_method="parent_child")
rag.ingest("long_report.pdf")
answer = rag.query("What were the Q3 results?")
Two-pass approach: sentence-boundary primary chunking followed by a semantic similarity regrouping pass. Merges adjacent chunks that are topically similar and splits those that diverge. Best quality; moderate speed overhead.
from cognity_ai import RAGLibrary
rag = RAGLibrary(chunker="hybrid", embedder="openai", openai_api_key="sk-...")
Comparison Table
| Key | Retrieval Quality | Context Coherence | Speed | Best For |
|---|---|---|---|---|
sentence |
Excellent | Excellent | Fast | General purpose, default choice |
fixed |
Good | Fair | Fastest | Structured/tabular text, CSV, JSON |
semantic |
Excellent | Excellent | Slow (embedding per sentence) | High-quality ingestion, topic-dense corpora |
recursive |
Good | Good | Fast | Markdown, code files, hierarchical text |
parent_child |
Excellent | Excellent | Moderate | Long documents needing precise retrieval + broad context |
hybrid |
Best | Best | Moderate–slow | Quality-critical pipelines with budget for two passes |
Configuration
Set the chunker via the chunker string key or through LibraryConfig:
from cognity_ai import RAGLibrary
from cognity_ai.config import LibraryConfig
# Via string key (simplest)
rag = RAGLibrary(chunker="sentence")
rag = RAGLibrary(chunker="parent_child", rag_method="parent_child")
# Via LibraryConfig (full control)
config = LibraryConfig(chunker="hybrid")
rag = RAGLibrary(config=config)
# Direct instantiation (advanced)
from cognity_ai.chunkers import HybridChunker
from cognity_ai.embedders import OpenAIEmbedder
chunker = HybridChunker(
embedder=OpenAIEmbedder(api_key="sk-..."),
max_tokens=768,
similarity_threshold=0.78,
)
chunks = chunker.chunk(document.text, doc_id="doc_1", pages=document.page_map)
semantic and hybrid chunkers require an embedder during chunking (not just during retrieval). Make sure the embedder is configured before calling ingest() with these strategies.
Custom Chunker
Implement BaseChunker and register it:
from cognity_ai.chunkers.base import BaseChunker
from cognity_ai.models import SemanticChunk, PageInfo
from cognity_ai import RAGLibrary
class ParagraphChunker(BaseChunker):
"""Split on double-newlines — simple paragraph-based chunking."""
def chunk(
self,
text: str,
doc_id: str,
pages: list[PageInfo] | None = None,
) -> list[SemanticChunk]:
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
return [
SemanticChunk(
id=f"{doc_id}_chunk_{i}",
doc_id=doc_id,
text=para,
embedding=None,
metadata={"char_start": text.index(para)},
entities=[],
is_parent=False,
parent_id=None,
)
for i, para in enumerate(paragraphs)
]
# Register and use
rag = RAGLibrary()
rag.register_chunker("paragraph", ParagraphChunker)
# Or pass instance directly via config
from cognity_ai.config import LibraryConfig
config = LibraryConfig(chunker="paragraph")
rag = RAGLibrary(config=config)
rag.ingest("document.txt")