Skip to content

vdh.extract_frames_by_mode should also return "source" information #384

@keighrim

Description

@keighrim

New Feature Summary

vdh.extract_frames_by_mode() (added via #373 released in 1.3.0) returns a list of frames (np.ndarray or PIL.Image) but discards the source TimePoint IDs in the process. Apps that consume the results have no way to know which TP inside the input TimeFrame produced each frame.

This is causing provenance regressions in downstream apps. For example, in clamsproject/app-smolvlm2-captioner#7, Owen reported that the captioner used to emit Alignment annotations whose source pointed to the per-frame TP, but after switching to extract_frames_by_mode(), there's no way to record the time points in the Alignment annotations. This is paritcularly bad for mode=reps since when multiple representatives exist, the captioner would emit multiple captions per TF without finer grounding. The same problem will hit any LLM-based app processing TF inputs (OCR, captioning, classification-via-LLM, etc.).

Proposed solution

Add a new function alongside the existing extract_frames_by_mode():

def extract_frames_by_mode_with_sources(...) -> List[Tuple[Frame, Union[str, int]]]:
    """
    Same behavior as extract_frames_by_mode(), but returns each frame paired
    with a grounding point: the long_id (str) of the source TP annotation if
    one is available, otherwise the sampled timestamp (int, in milliseconds)
    that the frame was extracted from. Either way the caller has a way to
    anchor the frame to a specific point in the source media.
    """

Per-mode behavior:

  • representatives mode: every frame paired with its source TP long_id (always str, since the mode skips TF without representatives)
  • single mode: paired with the middle representative's long_id (str) when representatives exist; paired with the midpoint timestamp in ms (int) when falling back to the start/end interval
  • all mode: each frame paired with its source TP long_id (str) when targets are present; paired with the sampled timestamp in ms (int) for each frame when no targets exist and stream-rate sampling is used

The existing extract_frames_by_mode() stays unchanged. Apps that need provenance migrate to the new function; apps that don't care keep using the existing one. No need to bump the major version.

Related

Implicit assumption to note

This proposal assumes that values in a TF's representatives (and targets) properties are always TP annotation IDs. That assumption is currently only ad-hoc; representatives is not a formally defined property on TF; it's an "additional property" used by convention. The formalization is being discussed at clamsproject/clams-vocabulary#12. The new helper depends on this convention being preserved in whatever the formal definition ends up being.

Alternatives

No response

Additional context

Metadata

Metadata

Assignees

Labels

✨NNew feature or request

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions