The field of text-to-video retrieval has advanced significantly with the evolution of language models and large-scale pre-training on generated caption-video pairs. Current methods predominantly focus on visual and event-based details, making retrieval largely reliant on tangible aspects. However, videos encompass more than just "seen"or "heard"elements, containing diverse, nuanced layers that are often overlooked.This work addresses this gap by introducing a method that incorporates audio, style, and emotion considerations into text-to-video retrieval through three key components. First, an augmentation block is implemented to generate additional textual information on a video's audio, style, and emotional aspects, supplementing the original caption. Second, a cross-modal audiovisual attention block fuses visual and audio data within the video, aligning it with this enriched textual information. Third, hybrid space learning is applied, using multiple latent spaces to align textual and video data, which minimizes potential conflicts between various information sources.In standard evaluations, models are often tested on benchmark datasets that emphasize simple, short, visual and event-based queries. To more accurately assess model performance under diverse query conditions that capture the nuanced dimensions of video content, we developed a new evaluation dataset. Our results demonstrate that, while our method performs comparably with state-of-the-art models on conventional test sets, it surpasses non-pre-trained models when addressing more complex queries, as evidenced by this novel test dataset.