Skip to content

Abnormal Inference Time and Repetitive Summary with Efficient-Large-Model/LongVILA-R1-7B on Specific Video Chunk #270

@mobassir94

Description

@mobassir94

When using LongViLa-R1 for video summarization, I encountered an issue where one video chunk took an abnormally long time to process, resulting in a large summary with significant repetition.

Model: LongViLa-R1

Input Prompt: "Concise summary of the video."

Video: A video containing Traffic Accident scene.

Length: 80 seconds

Resolution: 1920x1080

FPS: 20

Chunking Settings:

Chunk Size: 10 seconds

I got 8 chunked videos, each having 10 seconds duration.

Observed Behavior:

1.Chunk 3 exhibited an exceptionally high inference time of 523.76 seconds, whereas other chunks averaged around 6 seconds.

2.The summary generated for Chunk 3 was excessively long and contained numerous repeated sentences, failing to provide a concise summary as requested by the prompt.

This suggests a potential issue where the model gets stuck in a loop or encounters a specific type of content in a video chunk that causes a performance bottleneck and output generation failure.

Per-Chunk Inference Log

Chunk 1: VLM inference time = 6.69 seconds

Chunk 2: VLM inference time = 5.09 seconds

Chunk 3: VLM inference time = 523.76 seconds

Chunk 4: VLM inference time = 6.52 seconds

Chunk 5: VLM inference time = 6.52 seconds

Chunk 6: VLM inference time = 5.29 seconds

Chunk 7: VLM inference time = 6.43 seconds

Chunk 8: VLM inference time = 4.35 seconds

please check summary.txt that contains all 8 chunks response

How can this issue be solved, and why is it occurring?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions