Skip to content

feat: added PythonCodeSplitter#11380

Merged
julian-risch merged 27 commits into
deepset-ai:mainfrom
MechaCritter:feature/python_document_splitter
May 29, 2026
Merged

feat: added PythonCodeSplitter#11380
julian-risch merged 27 commits into
deepset-ai:mainfrom
MechaCritter:feature/python_document_splitter

Conversation

@MechaCritter
Copy link
Copy Markdown
Contributor

@MechaCritter MechaCritter commented May 23, 2026

Related Issues

Proposed Changes:

Added class PythonCodeSplitter class as well as the unit tests. For the motivation, please see the issue above.

This splitter will produce documents where functions of reasonable size will fit in as a whole, whereas oversized functions will be split using the fallback secondary split (to avoid bloating up context upon retrieval because of massive methods). The user can steer how a function is defined as "oversized" via the parameter max_effective_lines, expected_chars_per_line and oversized_factor.

The user might strip the docstring and save it in the meta instead of having it in the stored content directly so that each document can contain more methods.

How did you test it?

  • I wrote the unittests using Claude Code. NOTE: I only gave Claude Code the docstring of the class plus the whole init method. Claude did not read the source code file. I then executed the tests myself, everything passed.

Notes for the reviewer

  • The parameter expected_chars_per_line is default to 45 because statistically, as I researched, each Python code line has roughly 45 characters.
  • I decided to measure code length using effective lines because there might be codebases with a single massive dictionary. In that case, expected_chars_per_line is more robust.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 23, 2026

@MechaCritter is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions Bot added the type:documentation Improvements on the docs label May 25, 2026
…itions into blocks with methods where the class definition fell into another block.

-  removed "class_name" meta, added "include_classes" that shows all classes present within that block (e.g. when at least 1 method belongs to that class)
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 25, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  haystack/components/preprocessors
  __init__.py
  python_code_splitter.py 150, 162, 190, 197, 231, 280, 308-310, 411-413, 415, 528, 594
  haystack/core/pipeline
  async_pipeline.py
Project Total  

This report was generated by python-coverage-comment-action

@MechaCritter MechaCritter changed the title initialized class PythonCodeSplitter feat: added PythonCodeSplitter May 26, 2026
@MechaCritter MechaCritter marked this pull request as ready for review May 26, 2026 18:45
@MechaCritter MechaCritter requested a review from a team as a code owner May 26, 2026 18:45
@MechaCritter MechaCritter requested review from julian-risch and removed request for a team May 26, 2026 18:45
MechaCritter and others added 4 commits May 26, 2026 20:48
Fixes the "an syntax-aware" → "a syntax-aware" grammar bug and rewrites
the release note to describe the splitter's user-visible behavior
(syntax-aware merging, oversized fallback, docstring stripping, class
signature preservation, and per-chunk metadata), in line with the
``MarkdownHeaderSplitter`` release note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The class_header path in ``_emit_class_units`` reimplemented the same
slice-out-the-docstring logic as ``_strip_docstring``, just with the
additional guard that the docstring must lie within the header range
(otherwise it belongs to a later chunk). Fold that range check into
``_strip_docstring`` so both call sites share one implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous ``expected_slice.strip() in content or content.strip() in
expected_slice`` check passed trivially for short chunks - either side
of the ``or`` could be true while a line was silently dropped from the
slice. Replaced it with an exact-substring check using
``splitlines(keepends=True)`` so that a regression dropping or
reordering lines inside ``[start_line, end_line]`` would fail, and
added a companion test with ``preserve_class_definition=False`` that
asserts the chunk content ends exactly with the source slice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the new contiguous-slice test in line with the project's ruff
line-length, fixing the ``format`` CI check that failed on the
previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
haystack-docs Ignored Ignored Preview May 29, 2026 1:58pm

Request Review

julian-risch and others added 5 commits May 29, 2026 15:39
Collapsed nine near-identical ``test_invalid_*`` cases that each
constructed one bad parameter and asserted ``ValueError`` into a single
parametrized test. Same coverage, ~20 lines removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The second usage example reproduced the entire ``strip_docstrings=True``
setup just to show ``meta_fields_to_embed``; collapse it to a single
sentence pointing at the embedder option. The first example already
demonstrates how stripping moves docstrings into ``meta``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er tests

Merged five small test classes into related host classes so the file
holds fewer top-level groupings while keeping every test:

- ``TestClassMetadata`` (2 tests) → ``TestIncludeClassesMeta`` (same topic)
- ``TestThreeDecorators`` (2 tests) → ``TestDecorators`` (same topic)
- ``TestUnitKinds`` (1 test) and ``TestMultipleDocuments`` (2 tests)
  → ``TestBasicOutput`` (both exercise chunk-level meta)
- ``TestEffectiveLines`` (1 test) → ``TestOversizedFallback``
  (both exercise merge-sizing behavior)

No tests added, removed, or weakened; 69 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compressed the class docstring (dropped the duplicated General Behavior
section, the verbose example-output blocks, and one of two
usage examples), shortened the ``__init__`` parameter descriptions, and
trimmed multi-line inline comments down to a single line where the
extra prose was restating the obvious. Public information preserved:
all parameters still documented, behavior contract unchanged, and one
runnable usage example kept. Source file 789 -> 612 lines (-177).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ter tests

- Dropped multi-line ``assert ..., f"long message"`` failure strings;
  pytest's assertion introspection already prints the operands.
- Dropped one-line comments that restated the next assertion.
- Parametrized the three ``test_*_unit_kind_present`` cases (one body,
  three kinds) and the two ``include_classes`` membership cases (one
  body, two class names); dropped the subsumed
  ``test_include_classes_set_for_class_chunks``.
- Inlined a few list-comprehensions in place of ``for``-extend loops.

68 tests pass (down from 69 only because the dropped union test is
trivially implied by the two parametrized membership tests).
Test file 982 -> 874 lines (-108).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening this PR @MechaCritter ! I made some smaller improvements, including extracting a small helper for _emit_class_units and _strip_docstring.
It's a big PR so I reduced the number of lines of code with a couple of commits to make the code better readable and maintainable.
Thanks again and please let us know how your project with this new component is going! I'll open a follow up issue regarding adding a documentation page about this new component.

@julian-risch julian-risch enabled auto-merge (squash) May 29, 2026 14:03
@julian-risch julian-risch merged commit fd10d51 into deepset-ai:main May 29, 2026
24 checks passed
@MechaCritter MechaCritter deleted the feature/python_document_splitter branch May 29, 2026 17:56
@MechaCritter
Copy link
Copy Markdown
Contributor Author

Thanks for your review @julian-risch . It was fun contributing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Support for code syntax-aware Document Splitters

2 participants