feat: added PythonCodeSplitter#11380
Conversation
|
@MechaCritter is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
…itions into blocks with methods where the class definition fell into another block. - removed "class_name" meta, added "include_classes" that shows all classes present within that block (e.g. when at least 1 method belongs to that class)
…true meaning, and fixed some docstrings
…ar in first document block
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||
Fixes the "an syntax-aware" → "a syntax-aware" grammar bug and rewrites the release note to describe the splitter's user-visible behavior (syntax-aware merging, oversized fallback, docstring stripping, class signature preservation, and per-chunk metadata), in line with the ``MarkdownHeaderSplitter`` release note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The class_header path in ``_emit_class_units`` reimplemented the same slice-out-the-docstring logic as ``_strip_docstring``, just with the additional guard that the docstring must lie within the header range (otherwise it belongs to a later chunk). Fold that range check into ``_strip_docstring`` so both call sites share one implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous ``expected_slice.strip() in content or content.strip() in expected_slice`` check passed trivially for short chunks - either side of the ``or`` could be true while a line was silently dropped from the slice. Replaced it with an exact-substring check using ``splitlines(keepends=True)`` so that a regression dropping or reordering lines inside ``[start_line, end_line]`` would fail, and added a companion test with ``preserve_class_definition=False`` that asserts the chunk content ends exactly with the source slice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the new contiguous-slice test in line with the project's ruff line-length, fixing the ``format`` CI check that failed on the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
Collapsed nine near-identical ``test_invalid_*`` cases that each constructed one bad parameter and asserted ``ValueError`` into a single parametrized test. Same coverage, ~20 lines removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The second usage example reproduced the entire ``strip_docstrings=True`` setup just to show ``meta_fields_to_embed``; collapse it to a single sentence pointing at the embedder option. The first example already demonstrates how stripping moves docstrings into ``meta``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er tests Merged five small test classes into related host classes so the file holds fewer top-level groupings while keeping every test: - ``TestClassMetadata`` (2 tests) → ``TestIncludeClassesMeta`` (same topic) - ``TestThreeDecorators`` (2 tests) → ``TestDecorators`` (same topic) - ``TestUnitKinds`` (1 test) and ``TestMultipleDocuments`` (2 tests) → ``TestBasicOutput`` (both exercise chunk-level meta) - ``TestEffectiveLines`` (1 test) → ``TestOversizedFallback`` (both exercise merge-sizing behavior) No tests added, removed, or weakened; 69 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compressed the class docstring (dropped the duplicated General Behavior section, the verbose example-output blocks, and one of two usage examples), shortened the ``__init__`` parameter descriptions, and trimmed multi-line inline comments down to a single line where the extra prose was restating the obvious. Public information preserved: all parameters still documented, behavior contract unchanged, and one runnable usage example kept. Source file 789 -> 612 lines (-177). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ter tests - Dropped multi-line ``assert ..., f"long message"`` failure strings; pytest's assertion introspection already prints the operands. - Dropped one-line comments that restated the next assertion. - Parametrized the three ``test_*_unit_kind_present`` cases (one body, three kinds) and the two ``include_classes`` membership cases (one body, two class names); dropped the subsumed ``test_include_classes_set_for_class_chunks``. - Inlined a few list-comprehensions in place of ``for``-extend loops. 68 tests pass (down from 69 only because the dropped union test is trivially implied by the two parametrized membership tests). Test file 982 -> 874 lines (-108). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
julian-risch
left a comment
There was a problem hiding this comment.
Thank you for opening this PR @MechaCritter ! I made some smaller improvements, including extracting a small helper for _emit_class_units and _strip_docstring.
It's a big PR so I reduced the number of lines of code with a couple of commits to make the code better readable and maintainable.
Thanks again and please let us know how your project with this new component is going! I'll open a follow up issue regarding adding a documentation page about this new component.
|
Thanks for your review @julian-risch . It was fun contributing :) |
Related Issues
Proposed Changes:
Added class
PythonCodeSplitterclass as well as the unit tests. For the motivation, please see the issue above.This splitter will produce documents where functions of reasonable size will fit in as a whole, whereas oversized functions will be split using the fallback secondary split (to avoid bloating up context upon retrieval because of massive methods). The user can steer how a function is defined as "oversized" via the parameter
max_effective_lines,expected_chars_per_lineandoversized_factor.The user might strip the docstring and save it in the
metainstead of having it in the stored content directly so that each document can contain more methods.How did you test it?
Notes for the reviewer
expected_chars_per_lineis default to 45 because statistically, as I researched, each Python code line has roughly 45 characters.effective linesbecause there might be codebases with a single massive dictionary. In that case,expected_chars_per_lineis more robust.Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.