feat: added PythonCodeSplitter by MechaCritter · Pull Request #11380 · deepset-ai/haystack

MechaCritter · 2026-05-23T08:31:18Z

Related Issues

fixes [FEATURE] Support for code syntax-aware Document Splitters #11354

Proposed Changes:

Added class PythonCodeSplitter class as well as the unit tests. For the motivation, please see the issue above.

This splitter will produce documents where functions of reasonable size will fit in as a whole, whereas oversized functions will be split using the fallback secondary split (to avoid bloating up context upon retrieval because of massive methods). The user can steer how a function is defined as "oversized" via the parameter max_effective_lines, expected_chars_per_line and oversized_factor.

The user might strip the docstring and save it in the meta instead of having it in the stored content directly so that each document can contain more methods.

How did you test it?

I wrote the unittests using Claude Code. NOTE: I only gave Claude Code the docstring of the class plus the whole init method. Claude did not read the source code file. I then executed the tests myself, everything passed.

Notes for the reviewer

The parameter expected_chars_per_line is default to 45 because statistically, as I researched, each Python code line has roughly 45 characters.
I decided to measure code length using effective lines because there might be codebases with a single massive dictionary. In that case, expected_chars_per_line is more robust.

Checklist

I have read the contributors guidelines and the code of conduct.
I have updated the related issue with new insights and changes.
I have added unit tests and updated the docstrings.
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I have documented my code.
I have added a release note file, following the contributors guidelines.
I have run pre-commit hooks and fixed any issue.

vercel · 2026-05-23T08:31:23Z

@MechaCritter is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

…itions into blocks with methods where the class definition fell into another block. - removed "class_name" meta, added "include_classes" that shows all classes present within that block (e.g. when at least 1 method belongs to that class)

…true meaning, and fixed some docstrings

…ar in first document block

github-actions · 2026-05-25T20:59:34Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
haystack/components/preprocessors
__init__.py
python_code_splitter.py					150, 162, 190, 197, 231, 280, 308-310, 411-413, 415, 528, 594
haystack/core/pipeline
async_pipeline.py
Project Total

_{This report was generated by python-coverage-comment-action}

Fixes the "an syntax-aware" → "a syntax-aware" grammar bug and rewrites the release note to describe the splitter's user-visible behavior (syntax-aware merging, oversized fallback, docstring stripping, class signature preservation, and per-chunk metadata), in line with the ``MarkdownHeaderSplitter`` release note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The class_header path in ``_emit_class_units`` reimplemented the same slice-out-the-docstring logic as ``_strip_docstring``, just with the additional guard that the docstring must lie within the header range (otherwise it belongs to a later chunk). Fold that range check into ``_strip_docstring`` so both call sites share one implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous ``expected_slice.strip() in content or content.strip() in expected_slice`` check passed trivially for short chunks - either side of the ``or`` could be true while a line was silently dropped from the slice. Replaced it with an exact-substring check using ``splitlines(keepends=True)`` so that a regression dropping or reordering lines inside ``[start_line, end_line]`` would fail, and added a companion test with ``preserve_class_definition=False`` that asserts the chunk content ends exactly with the source slice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings the new contiguous-slice test in line with the project's ruff line-length, fixing the ``format`` CI check that failed on the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-29T12:52:30Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
haystack-docs	Ignored	Preview	May 29, 2026 1:58pm

Collapsed nine near-identical ``test_invalid_*`` cases that each constructed one bad parameter and asserted ``ValueError`` into a single parametrized test. Same coverage, ~20 lines removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The second usage example reproduced the entire ``strip_docstrings=True`` setup just to show ``meta_fields_to_embed``; collapse it to a single sentence pointing at the embedder option. The first example already demonstrates how stripping moves docstrings into ``meta``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er tests Merged five small test classes into related host classes so the file holds fewer top-level groupings while keeping every test: - ``TestClassMetadata`` (2 tests) → ``TestIncludeClassesMeta`` (same topic) - ``TestThreeDecorators`` (2 tests) → ``TestDecorators`` (same topic) - ``TestUnitKinds`` (1 test) and ``TestMultipleDocuments`` (2 tests) → ``TestBasicOutput`` (both exercise chunk-level meta) - ``TestEffectiveLines`` (1 test) → ``TestOversizedFallback`` (both exercise merge-sizing behavior) No tests added, removed, or weakened; 69 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Compressed the class docstring (dropped the duplicated General Behavior section, the verbose example-output blocks, and one of two usage examples), shortened the ``__init__`` parameter descriptions, and trimmed multi-line inline comments down to a single line where the extra prose was restating the obvious. Public information preserved: all parameters still documented, behavior contract unchanged, and one runnable usage example kept. Source file 789 -> 612 lines (-177). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ter tests - Dropped multi-line ``assert ..., f"long message"`` failure strings; pytest's assertion introspection already prints the operands. - Dropped one-line comments that restated the next assertion. - Parametrized the three ``test_*_unit_kind_present`` cases (one body, three kinds) and the two ``include_classes`` membership cases (one body, two class names); dropped the subsumed ``test_include_classes_set_for_class_chunks``. - Inlined a few list-comprehensions in place of ``for``-extend loops. 68 tests pass (down from 69 only because the dropped union test is trivially implied by the two parametrized membership tests). Test file 982 -> 874 lines (-108). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

julian-risch

Thank you for opening this PR @MechaCritter ! I made some smaller improvements, including extracting a small helper for _emit_class_units and _strip_docstring.
It's a big PR so I reduced the number of lines of code with a couple of commits to make the code better readable and maintainable.
Thanks again and please let us know how your project with this new component is going! I'll open a follow up issue regarding adding a documentation page about this new component.

MechaCritter · 2026-05-29T18:01:28Z

Thanks for your review @julian-risch . It was fun contributing :)

initialized class PythonCodeSplitter

a69d8f9

MechaCritter added 2 commits May 25, 2026 15:06

added Python Code Splitter main code

bdc277a

added python_code_splitter inti __init__

3e51bfc

github-actions Bot added the type:documentation Improvements on the docs label May 25, 2026

MechaCritter added 9 commits May 25, 2026 19:30

updated docstring of _CodeUnit

ebe4fad

added test

10ea359

added release note

6590ec8

changed some variables in _extract_units_extract_units to convey its …

eb78982

…true meaning, and fixed some docstrings

added test_invalid_secondary_split_length

605c0c9

added test_strip_class_header_docstring_moves_to_meta

d81447e

simplified release note

4b02784

added test for top level statements: top level statements should appe…

f396e39

…ar in first document block

github-actions Bot added the topic:tests label May 25, 2026

MechaCritter added 4 commits May 25, 2026 20:34

formatted code

f0e5a4c

fixed bug in test, removed the unused key

2a01c55

fixed small bugs in test

b7d05b6

refractored main code a little

f21879d

small chanegs to docstring and type annotation in source code

03bb66a

MechaCritter changed the title ~~initialized class PythonCodeSplitter~~ feat: added PythonCodeSplitter May 26, 2026

MechaCritter marked this pull request as ready for review May 26, 2026 18:45

MechaCritter requested a review from a team as a code owner May 26, 2026 18:45

MechaCritter requested review from julian-risch and removed request for a team May 26, 2026 18:45

MechaCritter and others added 4 commits May 26, 2026 20:48

reverted kind to be str typed to avoid problem with type checkers

dc4f026

style: apply hatch fmt to test_python_code_splitter

acdb09b

Brings the new contiguous-slice test in line with the project's ruff line-length, fixing the ``format`` CI check that failed on the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

julian-risch and others added 5 commits May 29, 2026 15:39

julian-risch approved these changes May 29, 2026

View reviewed changes

julian-risch enabled auto-merge (squash) May 29, 2026 14:03

julian-risch mentioned this pull request May 29, 2026

Docs: Add a docs page for PythonCodeSplitter #11434

Open

julian-risch merged commit fd10d51 into deepset-ai:main May 29, 2026
24 checks passed

MechaCritter deleted the feature/python_document_splitter branch May 29, 2026 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: added PythonCodeSplitter#11380

feat: added PythonCodeSplitter#11380
julian-risch merged 27 commits into
deepset-ai:mainfrom
MechaCritter:feature/python_document_splitter

MechaCritter commented May 23, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 25, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 29, 2026 •

edited

Loading

Uh oh!

julian-risch left a comment

Uh oh!

Uh oh!

MechaCritter commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MechaCritter commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

vercel Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

vercel Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MechaCritter commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MechaCritter commented May 23, 2026 •

edited

Loading

github-actions Bot commented May 25, 2026 •

edited

Loading

vercel Bot commented May 29, 2026 •

edited

Loading