Publications

You can also find my articles on my Google Scholar profile.

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

We introduced T-MASS, where text is modeled as a stochastic embedding, facilitating joint learning of the text mass and video points.

Recommended citation: Wang, J., Sun, G., Wang, P., Liu, D., Dianat, S.A., Rabbani, M., Rao, R.M., & Tao, Z. (2024). Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval. CVPR
Download Paper

Prototypical Transformer as Unified Motion Learners

Published in Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

This work refines the feature representations via prototype-feature association

Recommended citation: Han, C., Lu, Y., Sun, G., Liang, J., Cao, Z., Wang, Q., Guan, Q., Dianat, S.A., Rao, R.M., Geng, T., Tao, Z., & Liu, D. (2024). Prototypical Transformer as Unified Motion Learners. ICML
Download Paper

Aligning Out-of-Distribution Web Images and Caption Semantics via Evidential Learning

Published in Proceedings of the ACM on Web Conference (WWW), 2024

This work efficiently improve the pre-trained vision-language networks in terms of robustness and performance when handling ID and OOD cases in image-text retrieval tasks via evidence knowledge.

Recommended citation: Guohao Sun, Yue Bai, Xueying Yang, Yi Fang, Yun Fu, and Zhiqiang Tao. 2024. Aligning Out-of-Distribution Web Images and Caption Semantics via Evidential Learning. WWW.
Download Paper

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Published in Proceedings of the 18th European Conference on Computer Vision (ECCV), 2024

This work has introduced a new training method that enhances general-purpose vision-language understanding and image-oriented question answering through visual self-questioning.

Recommended citation: Sun, G., Qin, C., Wang, J., Chen, Z., Xu, R., & Tao, Z. (2024). SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. ECCV
Download Paper