I am a Research Scientist and Manager at Apple AI/ML, primarily working on building large-scale vision and multimodal foundation models. Before joining Apple, I was a Principal Researcher at Microsoft Azure AI, working on Project Florence-VL. I received my Ph.D. degree from Duke University in Spring 2018, and my Master's and B.Sc. degree from Peking University in 2013 and 2010, respectively. My Ph.D. advisor is Lawrence Carin. I can be reached at pkuganzhe@gmail.com and zhe.gan@apple.com.
I am serving (or, has served) as a Senior Area Chair (SAC) for ACL 2025, EMNLP 2024, an Area Chair for NeurIPS 2024/2023/2022/2021/2020/2019, ICML 2024/2023/2022/2021, ICLR 2024/2023/2021, CVPR 2024/2023, ECCV 2022, WACV 2024, ACL 2024/2022/2021, NAACL 2024/2022, EMNLP 2023/2022, COLM 2024, AAAI 2023/2022, and a Senior Program Committee (SPC) member for AAAI 2021/2020, and received AAAI-20 Outstanding SPC Award. Together with my co-authors, I have also been honored with the Best Student Paper Honorable Mention Awards at CVPR 2021 and WACV 2021, respectively.
Research Highlights:
- [2024/7] 4 papers accepted to ECCV 2024: MM1, Ferret-UI, VeCLIP, and GRiT. Ferret-v2 got accepted to COLM 2024. Also, please checkout my talk on MM1 at CVPR 2024 tutorial and workshop (slides).
- [2024/3] Please checkout our new paper MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.
- [2024/2] 4 papers accepted to ICLR 2024: (1) Ferret: a new multimodal LLM that can refer and ground anything anywhere at any granularity (here), (2) MGIE: multimodal LLM for guiding instruction-based image editing (here), (3) compressing LLMs: the truth is rarely pure and never simple (here), and (4) MOFI: learning image representations from noisy entity annotated images (here).
- [2023/9] Please checkout our survey paper/book on Multimodal Foundation Models: From Specialists to General-Purpose Assistants.
- [2023/6] We held a tutorial on Recent Advances in Vision Foundation Models at CVPR 2023. All the slides can now be downloaded from the tutorial webpage.
- [2023/6] MOFI is our new vision foundation model that is designed to learn image representations from noisy entity annotated images. To achieve this, we have created Image-to-Entities (I2E), a new large-scale dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild.
- [2023/2] 5 papers accepted to CVPR 2023: (1) X-decoder: generalist modeling for (open-vocab) segmentation and vision-language tasks; (2) ReCo: region-controlled text-to-image generation; (3) x-CLIP: enhancing CLIP with non-contrastive learning; (4) LAVENDER and VIOLET-v2: two empirical studies on video-language pre-training.
- [2023/1] Our recent work on Prompting GPT-3 To Be Reliable got accepted to ICLR 2023.
- [2023/1] Gave an invited talk on multimodal foundation models at WACV 2023 [slides].
- [2022/12] Please check out our new survey paper/book on Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, published at Foundations and Trends in Computer Graphics and Vision.
- [2022/9] 3 papers accepted to NeurIPS 2022: (1) NUWA Infinity, our new multimodal generative model for image synthesis; (2) FIBER, our new VLP model that provides a unified solution for both VL understanding and localization tasks; and (3) K-Lite, which explores how to enhance UniCL and GLIP models with external knowledge.
- [2022/9] UniTAB is accepted as an Oral paper at ECCV 2022. Following the line of work such as Pix2seq, Pix2seqV2, OFA, and Unified-IO, we propose a simple unified seq2seq learning framework that can output sequences with mixed text and box tokens.
- [2022/7] NUWA Infinity is our new multimodal generative model that is able to generate high-quality images and videos from given text or image input. We can generate images with resolution up to 38912 × 2048 pixels. Check our project website here.
- [2022/6] We held a tutorial on recent advances on vision-language pre-training at CVPR 2022. All our slides are available at our tutorial website now.
- [2022/6] Florence-GIT is our new multimodal generative foundation model, where we have trained a simple image-to-text transformer on 800M image-text pairs. GIT achieves new sota across 12 image/video captioning and QA tasks, including the first human-parity on TextCaps. GIT achieves an accuracy of 88.79% on ImageNet-1k using a generative scheme. GIT can recognize logos, landmarks, characters, etc.
© July 2024 Zhe Gan