Hi, thanks for releasing the code and checkpoints.
I am trying to understand the behaviour of the A2V block and wanted to verify whether I am using it correctly.
What I did
I took a video and extracted its corresponding audio from the same video.
Then I:
- Passed the audio through the audio_encoder
- Passed the video through the visual_encoder
- Fed the audio_encoder outputs into the A2V block to generate visual representations from audio
- Computed cosine similarity between:
- the visual representations generated by the A2V block, and
- the outputs of the visual_encoder
Since both correspond to the same video/audio pair, I expected some meaningful similarity.
Observation
However, I consistently obtain cosine similarity values close to zero.
I tested this using both:
and observed similar behaviour in both cases (cosine similarity remains near zero).
Question
Am I misunderstanding the intended usage of the A2V block, or is additional processing/projection/normalization required before comparing these representations?
For example:
- Should features be taken before/after a specific layer?
- Should pooling or normalization be applied?
- Is the A2V output not expected to align directly with visual_encoder outputs?
Any clarification on the expected behaviour would be very helpful.
Thanks!
Hi, thanks for releasing the code and checkpoints.
I am trying to understand the behaviour of the A2V block and wanted to verify whether I am using it correctly.
What I did
I took a video and extracted its corresponding audio from the same video.
Then I:
Since both correspond to the same video/audio pair, I expected some meaningful similarity.
Observation
However, I consistently obtain cosine similarity values close to zero.
I tested this using both:
and observed similar behaviour in both cases (cosine similarity remains near zero).
Question
Am I misunderstanding the intended usage of the A2V block, or is additional processing/projection/normalization required before comparing these representations?
For example:
Any clarification on the expected behaviour would be very helpful.
Thanks!