Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval

Jie Luo, Yiyu Wang, Mengshi Qi
2021
Journal paper

Abstract

With the current exponential growth of video-based social networks, video retrieval using natural language is receiving ever-increasing attention. Most existing approaches tackle this task by extracting individual frame-level spatial features to represent the whole video, while ignoring visual pattern consistencies and intrinsic temporal relationships across different frames. Furthermore, the semantic correspondence between natural language queries and person-centric actions in videos has not been fully explored. To address these problems, we propose a novel binary representation learning framework, named Semanticsaware Spatial-temporal Binaries (S(2)Bin), which simultaneously considers spatial-temporal context and semantic relationships for cross-modal video retrieval. By exploiting the semantic relationships between two modalities, S(2)Bin can efficiently and effectively generate binary codes for both videos and texts. In addition, we adopt an iterative optimization scheme to learn deep encoding functions with attribute-guided stochastic training. We evaluate our model on three video datasets and the experimental results demonstrate that S(2)Bin outperforms the state-of-the-art methods in terms of various cross-modal video retrieval tasks.

Official source

https://infoscience.epfl.ch/record/284443?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval

Graph Chatbot

Chat with Graph Search

Infusing structured knowledge priors in neural models for sample-efficient symbolic reasoning

Aggregating Spatial and Photometric Context for Photometric Stereo

Driving and suppressing the human language network using large language models

Infusing structured knowledge priors in neural models for sample-efficient symbolic reasoning

Aggregating Spatial and Photometric Context for Photometric Stereo

Driving and suppressing the human language network using large language models