A2V output has near-zero cosine similarity with visual encoder features

Hi, thanks for releasing the code and checkpoints.

I am trying to understand the behaviour of the A2V block and wanted to verify whether I am using it correctly.

What I did

I took a video and extracted its corresponding audio from the same video.

Then I:

  - Passed the audio through the audio_encoder
  - Passed the video through the visual_encoder
  - Fed the audio_encoder outputs into the A2V block to generate visual representations from audio
  - Computed cosine similarity between:
  - the visual representations generated by the A2V block, and
  - the outputs of the visual_encoder

Since both correspond to the same video/audio pair, I expected some meaningful similarity.

Observation

However, I consistently obtain cosine similarity values close to zero.

I tested this using both:
    

-     stage-1.pth
-     stage-3.pth

    
and observed similar behaviour in both cases (cosine similarity remains near zero).

Question

Am I misunderstanding the intended usage of the A2V block, or is additional processing/projection/normalization required before comparing these representations?

For example:

-   Should features be taken before/after a specific layer?
-   Should pooling or normalization be applied?
-   Is the A2V output not expected to align directly with visual_encoder outputs?

Any clarification on the expected behaviour would be very helpful.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A2V output has near-zero cosine similarity with visual encoder features #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

A2V output has near-zero cosine similarity with visual encoder features #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions