New Feature Summary
vdh.extract_frames_by_mode() (added via #373 released in 1.3.0) returns a list of frames (np.ndarray or PIL.Image) but discards the source TimePoint IDs in the process. Apps that consume the results have no way to know which TP inside the input TimeFrame produced each frame.
This is causing provenance regressions in downstream apps. For example, in clamsproject/app-smolvlm2-captioner#7, Owen reported that the captioner used to emit Alignment annotations whose source pointed to the per-frame TP, but after switching to extract_frames_by_mode(), there's no way to record the time points in the Alignment annotations. This is paritcularly bad for mode=reps since when multiple representatives exist, the captioner would emit multiple captions per TF without finer grounding. The same problem will hit any LLM-based app processing TF inputs (OCR, captioning, classification-via-LLM, etc.).
Proposed solution
Add a new function alongside the existing extract_frames_by_mode():
def extract_frames_by_mode_with_sources(...) -> List[Tuple[Frame, Union[str, int]]]:
"""
Same behavior as extract_frames_by_mode(), but returns each frame paired
with a grounding point: the long_id (str) of the source TP annotation if
one is available, otherwise the sampled timestamp (int, in milliseconds)
that the frame was extracted from. Either way the caller has a way to
anchor the frame to a specific point in the source media.
"""
Per-mode behavior:
representatives mode: every frame paired with its source TP long_id (always str, since the mode skips TF without representatives)
single mode: paired with the middle representative's long_id (str) when representatives exist; paired with the midpoint timestamp in ms (int) when falling back to the start/end interval
all mode: each frame paired with its source TP long_id (str) when targets are present; paired with the sampled timestamp in ms (int) for each frame when no targets exist and stream-rate sampling is used
The existing extract_frames_by_mode() stays unchanged. Apps that need provenance migrate to the new function; apps that don't care keep using the existing one. No need to bump the major version.
Related
Implicit assumption to note
This proposal assumes that values in a TF's representatives (and targets) properties are always TP annotation IDs. That assumption is currently only ad-hoc; representatives is not a formally defined property on TF; it's an "additional property" used by convention. The formalization is being discussed at clamsproject/clams-vocabulary#12. The new helper depends on this convention being preserved in whatever the formal definition ends up being.
Alternatives
No response
Additional context
New Feature Summary
vdh.extract_frames_by_mode()(added via #373 released in 1.3.0) returns a list of frames (np.ndarrayorPIL.Image) but discards the source TimePoint IDs in the process. Apps that consume the results have no way to know which TP inside the input TimeFrame produced each frame.This is causing provenance regressions in downstream apps. For example, in clamsproject/app-smolvlm2-captioner#7, Owen reported that the captioner used to emit
Alignmentannotations whosesourcepointed to the per-frame TP, but after switching toextract_frames_by_mode(), there's no way to record the time points in theAlignmentannotations. This is paritcularly bad formode=repssince when multiple representatives exist, the captioner would emit multiple captions per TF without finer grounding. The same problem will hit any LLM-based app processing TF inputs (OCR, captioning, classification-via-LLM, etc.).Proposed solution
Add a new function alongside the existing
extract_frames_by_mode():Per-mode behavior:
representativesmode: every frame paired with its source TP long_id (alwaysstr, since the mode skips TF without representatives)singlemode: paired with the middle representative's long_id (str) when representatives exist; paired with the midpoint timestamp in ms (int) when falling back to the start/end intervalallmode: each frame paired with its source TP long_id (str) whentargetsare present; paired with the sampled timestamp in ms (int) for each frame when no targets exist and stream-rate sampling is usedThe existing
extract_frames_by_mode()stays unchanged. Apps that need provenance migrate to the new function; apps that don't care keep using the existing one. No need to bump the major version.Related
Implicit assumption to note
This proposal assumes that values in a TF's
representatives(andtargets) properties are always TP annotation IDs. That assumption is currently only ad-hoc;representativesis not a formally defined property on TF; it's an "additional property" used by convention. The formalization is being discussed at clamsproject/clams-vocabulary#12. The new helper depends on this convention being preserved in whatever the formal definition ends up being.Alternatives
No response
Additional context
TimePointandTimeFrameapp-smolvlm2-captioner#7 — concrete regression motivating this requestAlignmenttype needs some improvement along with definition of "anchors" clams-vocabulary#12 — formalization ofrepresentatives/targets