`vdh.extract_frames_by_mode` should also return "source" information

### New Feature Summary

`vdh.extract_frames_by_mode()` (added via #373 released in 1.3.0) returns a list of frames (`np.ndarray` or `PIL.Image`) but discards the source TimePoint IDs in the process. Apps that consume the results have no way to know *which* TP inside the input TimeFrame produced each frame.

This is causing provenance regressions in downstream apps. For example, in https://github.com/clamsproject/app-smolvlm2-captioner/issues/7, Owen reported that the captioner used to emit `Alignment` annotations whose `source` pointed to the per-frame TP, but after switching to `extract_frames_by_mode()`, there's no way to record the time points in the `Alignment` annotations. This is paritcularly bad for `mode=reps` since when multiple representatives exist, the captioner would emit multiple captions per TF without finer grounding. The same problem will hit any LLM-based app processing TF inputs (OCR, captioning, classification-via-LLM, etc.).

## Proposed solution

Add a new function alongside the existing `extract_frames_by_mode()`:

```python
def extract_frames_by_mode_with_sources(...) -> List[Tuple[Frame, Union[str, int]]]:
    """
    Same behavior as extract_frames_by_mode(), but returns each frame paired
    with a grounding point: the long_id (str) of the source TP annotation if
    one is available, otherwise the sampled timestamp (int, in milliseconds)
    that the frame was extracted from. Either way the caller has a way to
    anchor the frame to a specific point in the source media.
    """
```

Per-mode behavior:
- `representatives` mode: every frame paired with its source TP long_id (always `str`, since the mode skips TF without representatives)
- `single` mode: paired with the middle representative's long_id (`str`) when representatives exist; paired with the midpoint timestamp in ms (`int`) when falling back to the start/end interval
- `all` mode: each frame paired with its source TP long_id (`str`) when `targets` are present; paired with the sampled timestamp in ms (`int`) for each frame when no targets exist and stream-rate sampling is used

The existing `extract_frames_by_mode()` stays unchanged. Apps that need provenance migrate to the new function; apps that don't care keep using the existing one. No need to bump the major version.


### Related

## Implicit assumption to note

This proposal assumes that values in a TF's `representatives` (and `targets`) properties are always TP annotation IDs. That assumption is currently only ad-hoc; `representatives` is not a formally defined property on TF; it's an "additional property" used by convention. The formalization is being discussed at https://github.com/clamsproject/clams-vocabulary/issues/12. The new helper depends on this convention being preserved in whatever the formal definition ends up being.


### Alternatives

_No response_

### Additional context

- https://github.com/clamsproject/app-smolvlm2-captioner/issues/7 — concrete regression motivating this request
- https://github.com/clamsproject/clams-vocabulary/issues/12 — formalization of `representatives`/`targets`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`vdh.extract_frames_by_mode` should also return "source" information #384

New Feature Summary

Proposed solution

Related

Implicit assumption to note

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vdh.extract_frames_by_mode should also return "source" information #384

Description

New Feature Summary

Proposed solution

Related

Implicit assumption to note

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`vdh.extract_frames_by_mode` should also return "source" information #384