FD-324: real-chat LCC measurement + failure-mode classification#123
Conversation
…cript Deterministic post-extraction repair sets Person.parents from birth events when the LLM fails to forward-reference the PairBond ID. ParentChild F1 0.366→0.815 N=3 avg (+123%), LCC 51%→89.5% (target ≥80% met), no regression. Ships connectivity_check.py script and 5 unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- connectivity_check.py: add --accumulate and --dump-disconnected modes for reproducible real-chat LCC measurement (accumulates discussions in order, commits PDP to DiagramData between discussions) - PROMPT_ENGINEERING_LOG.md: document real-chat LCC baseline/fixed results, failure-mode classification for both diagrams, F1 no-regression (2 runs) Real-chat results with fix: Patrick 1924: 23.1% → 30.0% (content-bounded, not fixable) Guillermo 1589: 84.6% → 88.5% (AC2 met) Synthetic avg: 79.1% → 86.2% (AC2 met) F1: Agg 0.654/0.633 (2 runs, within noise vs 0.651 baseline) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a deterministic repair function, infer_parents_from_birth_events, to resolve a forward-referencing issue where Person.parents fields were left null because pair bond IDs were not yet computed. This change significantly improves the ParentChild F1 score and family-tree connectivity (LCC %). Additionally, a new connectivity_check.py script is added to measure LCC percentages across various scenarios, along with comprehensive tests and updated documentation. Feedback on the changes suggests optimizing dyad normalization in the repair function to avoid sorting overhead and improve type safety, as well as correcting a comment in the connectivity check script that incorrectly describes a DFS implementation as BFS.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| bond_by_dyad: dict[tuple[int, int], int] = {} | ||
| for pb in deltas.pair_bonds: | ||
| if pb.id is not None and pb.person_a is not None and pb.person_b is not None: | ||
| bond_by_dyad[tuple(sorted([pb.person_a, pb.person_b]))] = pb.id # type: ignore[arg-type] | ||
|
|
||
| person_map = {p.id: p for p in deltas.people if p.id is not None} | ||
|
|
||
| for event in deltas.events: | ||
| if event.kind not in (EventKind.Birth, EventKind.Adopted): | ||
| continue | ||
| if event.child is None or event.person is None or event.spouse is None: | ||
| continue | ||
| dyad = tuple(sorted([event.person, event.spouse])) | ||
| pb_id = bond_by_dyad.get(dyad) # type: ignore[arg-type] |
There was a problem hiding this comment.
The current implementation uses sorted() and tuple() to normalize the person dyads, which introduces unnecessary list creation and sorting overhead. Additionally, it requires # type: ignore[arg-type] to bypass type-checking warnings.
We can optimize this by using a simple conditional expression (a, b) if a < b else (b, a) to construct the sorted tuple. This is faster, avoids memory allocation for lists, and is fully type-safe without needing any type ignores.
| bond_by_dyad: dict[tuple[int, int], int] = {} | |
| for pb in deltas.pair_bonds: | |
| if pb.id is not None and pb.person_a is not None and pb.person_b is not None: | |
| bond_by_dyad[tuple(sorted([pb.person_a, pb.person_b]))] = pb.id # type: ignore[arg-type] | |
| person_map = {p.id: p for p in deltas.people if p.id is not None} | |
| for event in deltas.events: | |
| if event.kind not in (EventKind.Birth, EventKind.Adopted): | |
| continue | |
| if event.child is None or event.person is None or event.spouse is None: | |
| continue | |
| dyad = tuple(sorted([event.person, event.spouse])) | |
| pb_id = bond_by_dyad.get(dyad) # type: ignore[arg-type] | |
| bond_by_dyad: dict[tuple[int, int], int] = {} | |
| for pb in deltas.pair_bonds: | |
| if pb.id is not None and pb.person_a is not None and pb.person_b is not None: | |
| a, b = pb.person_a, pb.person_b | |
| bond_by_dyad[(a, b) if a < b else (b, a)] = pb.id | |
| person_map = {p.id: p for p in deltas.people if p.id is not None} | |
| for event in deltas.events: | |
| if event.kind not in (EventKind.Birth, EventKind.Adopted): | |
| continue | |
| if event.child is None or event.person is None or event.spouse is None: | |
| continue | |
| p_a, p_b = event.person, event.spouse | |
| dyad = (p_a, p_b) if p_a < p_b else (p_b, p_a) | |
| pb_id = bond_by_dyad.get(dyad) |
| # BFS connected components | ||
| visited: set[int] = set() | ||
| component_sizes: list[int] = [] | ||
| for start in nodes: | ||
| if start in visited: | ||
| continue | ||
| queue = [start] | ||
| visited.add(start) | ||
| size = 0 | ||
| while queue: | ||
| cur = queue.pop() | ||
| size += 1 | ||
| for nb in adj[cur]: | ||
| if nb not in visited: | ||
| visited.add(nb) | ||
| queue.append(nb) | ||
| component_sizes.append(size) |
There was a problem hiding this comment.
The comment states # BFS connected components, but the implementation uses queue.pop() (without an index), which removes and returns the last element of the list. This implements a Last-In-First-Out (LIFO) stack, resulting in a Depth-First Search (DFS) rather than a Breadth-First Search (BFS).
While both DFS and BFS are perfectly correct for finding connected components and their sizes, we should update the comment to accurately reflect the DFS implementation, or use collections.deque with popleft() if BFS is specifically desired.
| # BFS connected components | |
| visited: set[int] = set() | |
| component_sizes: list[int] = [] | |
| for start in nodes: | |
| if start in visited: | |
| continue | |
| queue = [start] | |
| visited.add(start) | |
| size = 0 | |
| while queue: | |
| cur = queue.pop() | |
| size += 1 | |
| for nb in adj[cur]: | |
| if nb not in visited: | |
| visited.add(nb) | |
| queue.append(nb) | |
| component_sizes.append(size) | |
| # DFS connected components | |
| visited: set[int] = set() | |
| component_sizes: list[int] = [] | |
| for start in nodes: | |
| if start in visited: | |
| continue | |
| queue = [start] | |
| visited.add(start) | |
| size = 0 | |
| while queue: | |
| cur = queue.pop() | |
| size += 1 | |
| for nb in adj[cur]: | |
| if nb not in visited: | |
| visited.add(nb) | |
| queue.append(nb) | |
| component_sizes.append(size) |
lcc_percent keeps the User as a connecting node (the proband hub) and excludes only User+Assistant from the count, per the agreed AC#2 definition; the Assistant is dropped from the graph entirely. Deleting the User node fragmented correctly- extracted families (everyone connects through the proband). Adds 4 tests; fixes a DFS-mislabeled-as-BFS comment (Gemini review). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Disproved by transcript evidence: disc 60 explicitly states the orphans' parents (Jim+Sheila → Anthony/Joseph/Julia) and the user demanded the link, yet extraction sets ~0-3 parents. The gap is architectural (single-shot under-extraction of a 200+ statement conversation + no cross-session parent back-fill), addressable via the FD-319 cursor/windowing re-extraction — not a content ceiling. Four prompt-directive variants left Patrick within noise. Original wording struck through, not deleted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts: # btcopilot/tests/personal/test_pdp.py # btcopilot/training/connectivity_check.py # doc/PROMPT_ENGINEERING_LOG.md
Summary
--accumulateand--dump-disconnectedmodes toconnectivity_check.pyfor reproducible real-chat LCC measurement (accumulates discussions in order, committing PDP to DiagramData between discussions — mirrors live diagram growth)PROMPT_ENGINEERING_LOG.mdResults
F1 no-regression (2 runs, 6 GT synthetic): Agg 0.654/0.633 — within known ±0.05–0.10 noise vs 0.651 baseline.
Failure-mode classification (AC4)
Patrick's 30% LCC is content-bounded: the source conversation does not provide enough relationship structure to reach 80% regardless of prompt engineering.
Reproducible verification commands
🤖 Generated with Claude Code