This work aims to solve the problem of grounding phrases in visual data in a simplistic fashion by utilizing the rich semantic information available from a joint embedding space learnt from multi-modal data. A deep neural network architecture that extracts associative co-localization patterns in a self-supervised fashion has been built. Such patterns are a favorable outcome of training the network on a ranking objective for bidirectional image-caption retrieval. The generated co-localization maps can be thought of as rich pixel wise saliency-maps that develop naturally in a common multi-modal space and has the added advantage of using no excess parameterization.
The link to the Github Repository can be found here.
Collaborators in this project:
- Shashank Verma
- Noor Mohamed Ghouse