Skip to content

A2V output has near-zero cosine similarity with visual encoder features #5

@venkateshmeeami-sudo

Description

@venkateshmeeami-sudo

Hi, thanks for releasing the code and checkpoints.

I am trying to understand the behaviour of the A2V block and wanted to verify whether I am using it correctly.

What I did

I took a video and extracted its corresponding audio from the same video.

Then I:

  • Passed the audio through the audio_encoder
  • Passed the video through the visual_encoder
  • Fed the audio_encoder outputs into the A2V block to generate visual representations from audio
  • Computed cosine similarity between:
  • the visual representations generated by the A2V block, and
  • the outputs of the visual_encoder

Since both correspond to the same video/audio pair, I expected some meaningful similarity.

Observation

However, I consistently obtain cosine similarity values close to zero.

I tested this using both:

  • stage-1.pth
    
  • stage-3.pth
    

and observed similar behaviour in both cases (cosine similarity remains near zero).

Question

Am I misunderstanding the intended usage of the A2V block, or is additional processing/projection/normalization required before comparing these representations?

For example:

  • Should features be taken before/after a specific layer?
  • Should pooling or normalization be applied?
  • Is the A2V output not expected to align directly with visual_encoder outputs?

Any clarification on the expected behaviour would be very helpful.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions