Loc-Net: Co-localizing Text and Image Modalities.

This work aims to solve the problem of grounding phrases in visual data in a simplistic fashion by utilizing the rich semantic information available from a joint embedding space learnt from multi-modal data. A deep neural network architecture that extracts associative co-localization patterns in a self-supervised fashion has been built. Such patterns are a favorable outcome of training the network on a ranking objective for bidirectional image-caption retrieval. The generated co-localization maps can be thought of as rich pixel wise saliency-maps that develop naturally in a common multi-modal space and has the added advantage of using no excess parameterization.

The link to the Github Repository can be found here.

Collaborators in this project:

  • Shashank Verma
  • Noor Mohamed Ghouse