I am a Staff Research Scientist at Apple AI/ML, primarily working on building large-scale vision and multimodal foundation models. Before joining Apple, I was a Principal Researcher at Microsoft Azure AI, working on Project Florence-VL. I received my Ph.D. degree from Duke University in Spring 2018, and my Master's and B.Sc. degree from Peking University in 2013 and 2010, respectively. My Ph.D. advisor is Lawrence Carin. I can be reached at pkuganzhe@gmail.com and zhe.gan@apple.com.
I am serving (or, has served) as an Area Chair for NeurIPS 2023/2022/2021/2020/2019, ICML 2023/2022/2021, ICLR 2024/2023/2021, CVPR 2024/2023, ECCV 2022, ACL 2022/2021, NAACL 2022, EMNLP 2023/2022, AAAI 2023/2022, and a Senior Program Committee (SPC) member for AAAI 2021/2020, and received AAAI-20 Outstanding SPC Award. Together with my co-authors, I have also received the Best Student Paper Honorable Mention Awards at CVPR 2021 and WACV 2021, respectively.
Research Highlights:
- [2023/9] Please checkout our survey paper/book on Multimodal Foundation Models: From Specialists to General-Purpose Assistants.
- [2023/6] We held a tutorial on Recent Advances in Vision Foundation Models at CVPR 2023. All the slides can now be downloaded from the tutorial webpage.
- [2023/6] MOFI is our new vision foundation model that is designed to learn image representations from noisy entity annotated images. To achieve this, we have created Image-to-Entities (I2E), a new large-scale dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild.
- [2023/2] 5 papers accepted to CVPR 2023: (1) X-decoder: generalist modeling for (open-vocab) segmentation and vision-language tasks; (2) ReCo: region-controlled text-to-image generation; (3) x-CLIP: enhancing CLIP with non-contrastive learning; (4) LAVENDER and VIOLET-v2: two empirical studies on video-language pre-training.
- [2023/1] Our recent work on Prompting GPT-3 To Be Reliable got accepted to ICLR 2023.
- [2023/1] Gave an invited talk on multimodal foundation models at WACV 2023 [slides].
- [2022/12] Please check out our new survey paper/book on Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, published at Foundations and Trends in Computer Graphics and Vision.
- [2022/9] 3 papers accepted to NeurIPS 2022: (1) NUWA Infinity, our new multimodal generative model for image synthesis; (2) FIBER, our new VLP model that provides a unified solution for both VL understanding and localization tasks; and (3) K-Lite, which explores how to enhance UniCL and GLIP models with external knowledge.
- [2022/9] UniTAB is accepted as an Oral paper at ECCV 2022. Following the line of work such as Pix2seq, Pix2seqV2, OFA, and Unified-IO, we propose a simple unified seq2seq learning framework that can output sequences with mixed text and box tokens.
- [2022/7] NUWA Infinity is our new multimodal generative model that is able to generate high-quality images and videos from given text or image input. We can generate images with resolution up to 38912 × 2048 pixels. Check our project website here.
- [2022/6] We held a tutorial on recent advances on vision-language pre-training at CVPR 2022. All our slides are available at our tutorial website now.
- [2022/6] Florence-GIT is our new multimodal generative foundation model, where we have trained a simple image-to-text transformer on 800M image-text pairs. GIT achieves new sota across 12 image/video captioning and QA tasks, including the first human-parity on TextCaps. GIT achieves an accuracy of 88.79% on ImageNet-1k using a generative scheme. GIT can recognize logos, landmarks, characters, etc.
- [2022/3] 4 papers accepted to CVPR 2022, including (i) METER, an end-to-end transformer-based VLP model; (ii) LEMON, scaling up VLP for image captioning; (iii) SwinBERT, video-Swin-based video captioning model; and (iv) ViTCAP, ViT-based image captioning model.
- [2021/10] 3 papers accepted to the main conference track of NeurIPS 2021, including Sparse ViT, Elastic LTH, and GAN lottery tickets; and 2 papers accepted to the datasets and benchmarks track of NeurIPS 2021, including AdvGLUE and VALUE.
- [2021/10] During the summer, we have hosted a special Vision-Language Talk Series. With 11 invited speakers from both academia and industry, we have covered diverse topics ranging from image captioning, VQA, multimodal pre-training (ALIGN, MDETR), grounded visual generation, zero-shot object detection (ViLD), video-language understanding (MERLOT), to self-supervised learning (MoCo-v3). Want to know more? Please check the YouTube playlist and MSR video series.
- [2021/09] We all know GPT-3 is a strong few-shot learner for NLP problems, but can it also benefit multimodal tasks? In this new work, we provide an empirical study of GPT-3 for knowledge-based VQA, and show that prompting GPT-3 via the use of image captions with only 16 examples surpasses supervised sota by an absolute +8.6 points on the OK-VQA dataset (from 39.4 to 48.0).
© September 2023 Zhe Gan