This placeholder project examines how agents align visual observations with language over time, especially when scenes, objects, and goals evolve beyond closed-set assumptions.
We are interested in open-world recognition, grounded language understanding, and long-horizon video reasoning for systems that operate continuously rather than on isolated clips.
This page is a placeholder for future project details, papers, demos, and datasets.