I am a Research Scientist and Manager at Apple AI/ML, primarily working on building large-scale vision and multimodal foundation models. Before joining Apple, I was a Principal Researcher at Microsoft Azure AI, working on Project Florence-VL. I received my Ph.D. degree from Duke University in Spring 2018, and my Master's and B.Sc. degree from Peking University in 2013 and 2010, respectively. My Ph.D. advisor is Lawrence Carin. I can be reached at pkuganzhe@gmail.com and zhe.gan@apple.com.
I am serving (or, has served) as a Senior Area Chair (SAC) for ACL 2025, EMNLP 2024, an Area Chair for NeurIPS 2019-2024, ICML 2021-2025, ICLR 2021-2025, CVPR 2023-2025, ICCV 2025, ECCV 2022, WACV 2024, ACL 2021-2024, NAACL 2024/2022, EMNLP 2023/2022, COLM 2024, AAAI 2023/2022, and a Senior Program Committee (SPC) member for AAAI 2021/2020, and received AAAI-20 Outstanding SPC Award. Together with my co-authors, I have also been honored with the Best Student Paper Honorable Mention Awards at CVPR 2021 and WACV 2021, respectively.
Research Highlights:
- [2024/12] Please checkout our new papers: (1) AIMv2, an exploration of multimodal autoregressive pre-training of large vision encoders; and (2) GEA, an empirical analysis of both data and method choices in agents across robotics, planning, UI interactions, and video games.
- [2024/10] Please checkout our new papers: (1) Ferret-UI 2, a upgrade of Ferret-UI for universal UI understanding; (2) MM-Ego, an exploration of egocentric multimodal LLMs; and (3) Chain-of-thought reasoning for multimodal LLMs (paper).
- [2024/10] Please checkout our new papers: (1) MM1.5, a significant upgrade of MM1; (2) CLOC, our next-generation image encoder for multimodal LLM; (3) VeCapV2, our upgrade of VeCap, and a comprehensive study of synethic image captions across CLIP, multimodal LLM, and diffusion models; (4) SlowFast-LLaVA, an exploration of the slow-fast idea for video LLMs; and (5) a comprehensive study of alignment in multimodal LLMs (paper).
- [2024/7] 4 papers accepted to ECCV 2024: MM1, Ferret-UI, VeCLIP, and GRiT. Ferret-v2 got accepted to COLM 2024. Also, please checkout my talk on MM1 at CVPR 2024 tutorial and workshop (slides).
- [2024/3] Please checkout our new paper MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.
- [2024/2] 4 papers accepted to ICLR 2024: (1) Ferret: a new multimodal LLM that can refer and ground anything anywhere at any granularity (here), (2) MGIE: multimodal LLM for guiding instruction-based image editing (here), (3) compressing LLMs: the truth is rarely pure and never simple (here), and (4) MOFI: learning image representations from noisy entity annotated images (here).
- [2023/9] Please checkout our survey paper/book on Multimodal Foundation Models: From Specialists to General-Purpose Assistants.
- [2023/6] We held a tutorial on Recent Advances in Vision Foundation Models at CVPR 2023. All the slides can now be downloaded from the tutorial webpage.
- [2023/6] MOFI is our new vision foundation model that is designed to learn image representations from noisy entity annotated images. To achieve this, we have created Image-to-Entities (I2E), a new large-scale dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild.
- [2023/2] 5 papers accepted to CVPR 2023: (1) X-decoder: generalist modeling for (open-vocab) segmentation and vision-language tasks; (2) ReCo: region-controlled text-to-image generation; (3) x-CLIP: enhancing CLIP with non-contrastive learning; (4) LAVENDER and VIOLET-v2: two empirical studies on video-language pre-training.
- [2023/1] Our recent work on Prompting GPT-3 To Be Reliable got accepted to ICLR 2023.
- [2023/1] Gave an invited talk on multimodal foundation models at WACV 2023 [slides].
- [2022/12] Please check out our new survey paper/book on Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, published at Foundations and Trends in Computer Graphics and Vision.
© December 2024 Zhe Gan