Skip to content

Fix/candidate quantity wrongly evaluate#11

Open
lpi-tn wants to merge 5 commits into
mainfrom
Fix/candidate-quantity-wrongly-evaluate
Open

Fix/candidate quantity wrongly evaluate#11
lpi-tn wants to merge 5 commits into
mainfrom
Fix/candidate-quantity-wrongly-evaluate

Conversation

@lpi-tn

@lpi-tn lpi-tn commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

This pull request refactors the header/footer extraction logic in RefinedDocument to improve maintainability, robustness, and clarity. It removes the unify_list_len helper, introduces new internal methods for candidate and neighbor identification, and adds comprehensive docstrings and type annotations throughout the codebase. Additionally, the tests are updated to reflect these changes and to cover new edge cases.

Major changes include:

Refactoring and Code Organization

  • Removed the unify_list_len function and replaced its usage with an internal _pad_neighbours method within RefinedDocument, leading to more localized and explicit neighbor padding logic. (src/refinedoc/helpers.py, src/refinedoc/refined_document.py, tests/test_helpers.py) [1] [2] [3]
  • Introduced new private methods in RefinedDocument for identifying header/footer candidates (_identify_candidates), local neighbors (_identify_local_neighbours), and for padding neighbor lists (_pad_neighbours), improving code modularity and readability. (src/refinedoc/refined_document.py)

Type Annotations and Documentation

  • Added or improved type annotations and comprehensive docstrings for all public and private methods in RefinedDocument, as well as for helper functions, clarifying expected argument and return types. (src/refinedoc/refined_document.py, src/refinedoc/helpers.py) [1] [2] [3] [4] [5] [6]

Robustness and Error Handling

  • Improved error handling by raising NotImplementedError with a clear constant message for unsupported TargetedPart values, and by logging warnings when candidate quantities are too high or when pages are empty. (src/refinedoc/refined_document.py) [1] [2]

Test Suite Updates

  • Removed tests for the deleted unify_list_len function and added new tests to verify candidate quantity fallback behavior and warnings when processing short or empty pages. (tests/test_helpers.py, tests/test_refined_document.py) [1] [2]
  • Added a new setUp method in tests/test_refined_document.py to provide a reusable sample document for tests. (tests/test_refined_document.py)

Minor Improvements

  • Updated the TargetedPart enum with a docstring to clarify its purpose. (src/refinedoc/enumeration.py)

These changes collectively enhance the clarity, maintainability, and reliability of the document refinement and header/footer extraction process.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors RefinedDocument’s header/footer extraction to make candidate selection and neighbour handling more modular, while removing the unify_list_len helper and updating tests accordingly.

Changes:

  • Refactors header/footer extraction by introducing _identify_candidates, _identify_local_neighbours, and _pad_neighbours, and removing unify_list_len.
  • Adds/improves type annotations and docstrings across RefinedDocument and helpers.
  • Updates the test suite to remove unify_list_len tests and add new candidate-quantity/edge-case coverage.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/refinedoc/refined_document.py Refactors candidate + neighbour logic; adds docs/types; updates error handling/logging.
src/refinedoc/helpers.py Removes unify_list_len; improves type hints for helpers.
src/refinedoc/enumeration.py Adds docstring clarifying TargetedPart.
tests/test_helpers.py Removes tests for deleted unify_list_len.
tests/test_refined_document.py Adds shared fixture and new tests for candidate fallback + candidate identification helpers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 119 to 121
if not self._processed_footers or not self._processed_headers:
self._separate_header_footer(TargetedPart.HEADER)
self._separate_header_footer(TargetedPart.FOOTER)
Comment on lines +251 to 255
# Pad neighbors to the same size as the local comparison window.
self._pad_neighbours(local_neighbours, standardized_size, targeted_part)

standardized_size = len(max(local_neighbours, key=len))
header_weights = [w for w in generate_weights(standardized_size)]
header_weights = list(generate_weights(standardized_size))

Comment on lines +304 to +308
upper_part = header_footer_candidates[
min(page_index + 1, len(header_footer_candidates)) : min(
page_index + self.win, len(header_footer_candidates)
)
]
Comment on lines +554 to +556
local_big_document = self.big_document
local_big_document[1] = []
rd = RefinedDocument(content=self.big_document)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants