🐛 Bug fix for exact entities.#80
Conversation
…the fuzzy matching algorithm.
Reviewer's GuideIntroduces exact-match handling for specific entity labels in the entity disambiguation pipeline by propagating a normalized subclass key from anonymization postprocessing into canonical entity building, so that those labels cluster strictly by exact value instead of fuzzy similarity; also adjusts a notebook to point to a different example document. Sequence diagram for exact-match handling in entity disambiguationsequenceDiagram
participant AnonymizationPostprocess
participant FuzzyDisambiguation
participant CanonicalEntities
AnonymizationPostprocess->>AnonymizationPostprocess: process(ent)
AnonymizationPostprocess->>AnonymizationPostprocess: cleaned_text = pattern.sub("", ent.text)
AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_label_subclass = []
alt label in exact_labels
AnonymizationPostprocess->>AnonymizationPostprocess: flattened_text = re.sub("[^a-zA-Z0-9]", "", cleaned_text)
AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_label_subclass.append(flattened_text)
end
AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_alt_text = cleaned_text
FuzzyDisambiguation->>FuzzyDisambiguation: build_canonical_entities(labels, target_labels, threshold)
FuzzyDisambiguation->>FuzzyDisambiguation: grouped.setdefault(aymurai_label, []).append({text, aymurai_label, exact_alias})
loop for each label_type, items in grouped.items()
alt label_type in EXACT_LABELS
FuzzyDisambiguation->>FuzzyDisambiguation: exact_groups.setdefault(exact_alias, []).append(item)
FuzzyDisambiguation->>FuzzyDisambiguation: clusters = list(exact_groups.values())
else
FuzzyDisambiguation->>FuzzyDisambiguation: clusters = _cluster_aliases_with_cdist(items, threshold)
end
FuzzyDisambiguation->>CanonicalEntities: _clusters_to_canonical_entities(clusters)
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The
exact_labelsset is duplicated in bothfuzzy.pyandcore.py; consider centralizing this constant in a shared module to avoid divergence and make future updates easier. - In
anonymization_postprocess/core.py,aymurai_label_subclassis always reset to an empty list; verify whether you should preserve any existing subclasses or guard against overwriting previously set values. - The notebook change from
documents[14]todocuments[5]looks like a local experiment tweak; confirm this is the intended default behavior and not a temporary debugging choice.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `exact_labels` set is duplicated in both `fuzzy.py` and `core.py`; consider centralizing this constant in a shared module to avoid divergence and make future updates easier.
- In `anonymization_postprocess/core.py`, `aymurai_label_subclass` is always reset to an empty list; verify whether you should preserve any existing subclasses or guard against overwriting previously set values.
- The notebook change from `documents[14]` to `documents[5]` looks like a local experiment tweak; confirm this is the intended default behavior and not a temporary debugging choice.
## Individual Comments
### Comment 1
<location path="aymurai/utils/entity_disambiguation/fuzzy.py" line_range="10-18" />
<code_context>
from aymurai.meta.api_interfaces import DocLabel
from aymurai.meta.entities import CanonicalEntity
+EXACT_LABELS = {
+ "DNI",
+ "CUIT_CUIL",
+ "TELEFONO",
+ "PATENTE_DOMINIO",
+ "IP",
+ "NUM_CAJA_AHORRO",
+ "CBU",
+ "NUM_MATRICULA",
+}
+
</code_context>
<issue_to_address>
**suggestion:** Avoid duplicating the exact-label set in multiple modules by centralizing it
This set also exists here as `EXACT_LABELS` and in `anonymization_postprocess/core.py` as `exact_labels`. Please move it to a shared constants module and import it in both places so there’s a single source of truth and no risk of the two lists drifting out of sync.
Suggested implementation:
```python
from aymurai.meta.api_interfaces import DocLabel
from aymurai.meta.entities import CanonicalEntity
from aymurai.meta.constants import EXACT_LABELS
```
1. Create (or extend) a shared constants module, for example `aymurai/meta/constants.py`, and move the set definition there:
```python
EXACT_LABELS = {
"DNI",
"CUIT_CUIL",
"TELEFONO",
"PATENTE_DOMINIO",
"IP",
"NUM_CAJA_AHORRO",
"CBU",
"NUM_MATRICULA",
}
```
2. In `anonymization_postprocess/core.py`, replace the local `exact_labels` definition with an import from the same constants module, e.g.:
```python
from aymurai.meta.constants import EXACT_LABELS as exact_labels
```
(or adjust naming/import style to match existing conventions in that file).
3. Ensure `aymurai/meta/constants.py` is part of the package (has `__init__.py` as needed) and update any relevant `__all__` if your project uses it.
</issue_to_address>
### Comment 2
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="60-64" />
<code_context>
+ "NUM_MATRICULA",
+ }
+
+ ent["attrs"]["aymurai_label_subclass"] = []
+
+ if label in exact_labels:
+ flattened_text = re.sub(r"[^a-zA-Z0-9]", "", cleaned_text)
+ ent["attrs"]["aymurai_label_subclass"].append(flattened_text)
+
# Update the entity's alt text and indices
</code_context>
<issue_to_address>
**issue (bug_risk):** Re-initializing `aymurai_label_subclass` may unintentionally discard previous subclass information
Unconditionally assigning `ent["attrs"]["aymurai_label_subclass"] = []` clears any existing data in this field before you append the new value. If earlier steps in the pipeline set this attribute (now or in the future), this could cause data loss. Consider only initializing when absent (e.g., via `setdefault`/`get`) or otherwise making this logic additive rather than destructive.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Pull request overview
This PR adjusts entity disambiguation/anonymization so certain identifier-like labels (e.g., DNI/CBU/IP) are treated as exact identifiers (no fuzzy clustering), using a normalized “exact alias” derived from label subclass metadata.
Changes:
- Add an
EXACT_LABELSpath in canonical-entity building to group exact-identifier labels by a normalized alias instead of fuzzy clustering. - Update anonymization postprocessing to store a normalized subclass value for exact-identifier labels to support exact grouping.
- Update an experimental notebook to process a different sample document.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| notebooks/experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb | Changes which document sample index is processed in the experiment. |
| aymurai/utils/entity_disambiguation/fuzzy.py | Introduces exact-identifier grouping logic during canonical entity construction. |
| aymurai/transforms/anonymization_postprocess/core.py | Records a normalized subclass value for exact-identifier labels during entity cleaning. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
879309c8 Feat/entity manager mention feedback (#81) 4d2de106 Fix/responsive home layout (#80) 986e68d2 Fix/homogenize file check ui (#77) 046f8ab9 fix(file-annotator): fix upward autoscroll on search previous navigation (#76) 2ecf75dc feat(dependencies): add dnd-kit packages for drag-and-drop functionality git-subtree-dir: frontend git-subtree-split: 879309c841d8072babc4d06f1686d11cf8cbd03f
Summary by Sourcery
Handle certain entity labels as exact identifiers during disambiguation and anonymization, and adjust the experimental notebook document selection accordingly.
Bug Fixes:
Enhancements: