Measuring animal behaviors is a crucial task in a range of scientific applications. Modern approaches with deep neural networks calls for developing solutions that solve a wide range of vision and language tasks. This ubiquitous need for understanding behavior in both scientific areas and industrial applications, as well as the difficulty of modeling behavior occurring in the wild, motivates building more generalizable and robust deep neural networks. Motivated by this, this dissertation focuses on solving the following key problems: how do we train deep neural networks to be generalizable across different data domains? How do we train neural networks or neural network-based systems to perform a wide range vision and language tasks? How do we enable the model to adapt and learn at inference time? More specifically, can we enable models to have some level of adaptive intelligence, either by learning from interacting with users and data, or via dynamic computing such as in-context learning? Towards this goal, my aim was to get inspiration from both the progress in artificial intelligence and insights from neuroscience. In Chapter 2, we introduce methods to merge heterogeneous animal pose datasets and training algorithms for the model to counter domain shifts and catastrophic forgetting. I showed that the proposed methods gave rise to the first animal pose foundation model that has zero-shot performance comparable to a fully trained model in downstream tasks. Then in Chapter 3, we report the first keypoint-aware, unsupervised learning approach with transformers for re-identification of animals. To explore the growing utility of large language models (LLMs), in Chapter 4, we proposed the first LLM-based agentic system that uses pre-trained deep neural networks and a Python API as tools and that dynamically learn and adapt at inference time. This proposed system was built with principles inspired from neuroscience at its core, and helped push the frontier of using LLM-based agents to automate behavior analysis using frontier computer vision and large language models. Finally, in Chapter 5 we proposed novel methods to evaluate and improve multi-modal large language models to recognize challenging human actions that occur in daily life. This work aims to better evaluate and refine MLLMs that can recognize human actions.