Skip to content

Expose split_by_character parameter in HTTP API /documents/text endpoint #2942

@gn1024

Description

@gn1024

Problem

The Python API LightRAG.ainsert() supports split_by_character and split_by_character_only parameters for custom chunk splitting. However, the HTTP API endpoints /documents/text and /documents/texts in lightrag/api/document_routes.py do not expose these parameters — they are not included in the Pydantic request models and not passed through to ainsert().

This forces HTTP API users to rely solely on the built-in token-based chunker, even when they have pre-chunked content with a known separator.

Use Case

We pre-chunk documents with a semantic chunker (heading-aware, with breadcrumbs and atomic blocks) before sending to LightRAG. We join chunks with a unique separator and want LightRAG to split on it, preserving our chunk boundaries as-is.

Without split_by_character in the HTTP API, the only options are:

  1. Send each chunk as a separate document (/documents/texts with N items) — creates N doc_ids per file, breaks deletion, deduplication, and doc_status tracking.
  2. Use the Python API directly — not possible when LightRAG runs as a separate service.

Proposed Change

Add split_by_character and split_by_character_only fields to InsertTextRequest and InsertTextsRequest in document_routes.py, and pass them through to rag.ainsert().

InsertTextRequest

class InsertTextRequest(BaseModel):
    text: str
    # ... existing fields ...
    split_by_character: Optional[str] = Field(
        default=None,
        description="Character(s) to split the text on instead of token-based chunking",
    )
    split_by_character_only: bool = Field(
        default=False,
        description="If True, split only on split_by_character without token-based fallback",
    )

Route handler

await rag.ainsert(
    request.text,
    split_by_character=request.split_by_character,
    split_by_character_only=request.split_by_character_only,
)

Same for InsertTextsRequest / /documents/texts.

Notes

  • Fully backward compatible — both fields are optional with defaults matching current behavior.
  • We have a working patch in production (LightRAG v1.4.14) and can submit a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions