Zhe Gan

I am a Principal Research Scientist at Apple AI/ML, leading multimodal research and production efforts for Apple Foundation Models. Before joining Apple, I was a Principal Researcher at Microsoft Azure AI, working on Project Florence-VL. I received my Ph.D. degree from Duke University in Spring 2018, and my Master's and B.Sc. degree from Peking University in 2013 and 2010, respectively. My Ph.D. advisor is Lawrence Carin. I can be reached at pkuganzhe@gmail.com and zhe.gan@apple.com.

I am serving (or, has served) as a Senior Area Chair (SAC) for ACL 2025, EMNLP 2024, an Area Chair for NeurIPS 2019-2025, ICML 2021-2026, ICLR 2021-2025, CVPR 2023-2025, ICCV 2025, ECCV 2022, WACV 2024, ACL 2021-2024, NAACL 2024/2022, EMNLP 2023/2022, COLM 2025/2024, AAAI 2023/2022, and a Senior Program Committee (SPC) member for AAAI 2021/2020, and received AAAI-20 Outstanding SPC Award. Together with my co-authors, I have also been honored with the Best Student Paper Honorable Mention Awards at CVPR 2021 and WACV 2021, respectively.

Representative Works (in reverse chronological order):

[2025] Multimodal Leads for Apple Foundation Models [blog post] [tech report].
[2024] MM1 and Ferret, Apple's first research multimodal LLMs in the era of foundation models.
[2024] Ferret-UI, Apple's first GUI Agent work in the era of agents.
[2022] GIT, a generative image-to-text transformer for vision and language in the era of generative pre-training.
[2020] UNITER for multimodal pre-training in the era of BERT.
[2018] AttnGAN for text-to-image generation in the era of GAN.

Research Highlights:

[2025/10] Please checkout our recent works on agents: (1) DeepMMSearch-R1, empowering multimodal LLMs in multimodal web search; (2) UltraCUA, a Foundation Model for computer use agents with hybrid action, seamlessly integrating GUI primitives with high-level programmatic tool calls; (3) Ferret-UI Lite, where we present our lessons from building small on-device GUI agents; (4) AutoPlay, that aims to scale synthetic task generation for agents via exploration.

[2025/9] Please checkout Manzano, our recent flagship research project on unified multimodal models.

[2025/6] One paper accepted to ICCV 2025: UniVG, a generalist diffusion model for unified image generation and editing.

[2025/5] Our new image encoder work CLOC got accepted to ICML 2025.

[2025/3] Please checkout our new papers: (1) UniVG, a generalist diffusion model for unified image generation and editing; and (2) SlowFast-LLaVA-1.5, a family of token-efficient video LLMs for long-form video understanding.

[2025/2] 2 papers accepted to CVPR 2025: (1) AIMv2, an exploration of multimodal autoregressive pre-training of large vision encoders; and (2) GEA, an empirical analysis of both data and method choices in agents across robotics, planning, UI interactions, and video games.

[2025/1] 5 papers accepted to ICLR 2025: (1) MM1.5, a significant upgrade of MM1; (2) Ferret-UI 2, a upgrade of Ferret-UI for universal UI understanding; (3) MM-Ego, an exploration of egocentric multimodal LLMs; (4) VeCapV2, our upgrade of VeCap, and a comprehensive study of synethic image captions across CLIP, multimodal LLM, and diffusion models; and (5) MIA-Bench, a new benchmark that aims at better instruction following evaluation of multimodal LLMs.

[2024/10] Please checkout our new papers: (1) Chain-of-thought reasoning for multimodal LLMs (paper); (2) SlowFast-LLaVA, an exploration of the slow-fast idea for video LLMs; and (3) a comprehensive study of alignment in multimodal LLMs (paper).

[2024/7] 4 papers accepted to ECCV 2024: MM1, Ferret-UI, VeCLIP, and GRiT. Ferret-v2 got accepted to COLM 2024. Also, please checkout my talk on MM1 at CVPR 2024 tutorial and workshop (slides).

[2024/3] Please checkout our new paper MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.

[2024/2] 4 papers accepted to ICLR 2024: (1) Ferret: a new multimodal LLM that can refer and ground anything anywhere at any granularity (here), (2) MGIE: multimodal LLM for guiding instruction-based image editing (here), (3) compressing LLMs: the truth is rarely pure and never simple (here), and (4) MOFI: learning image representations from noisy entity annotated images (here).

[2023/9] Please checkout our survey paper/book on Multimodal Foundation Models: From Specialists to General-Purpose Assistants.

[2023/6] MOFI is our new vision foundation model that is designed to learn image representations from noisy entity annotated images. To achieve this, we have created Image-to-Entities (I2E), a new large-scale dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild.