I am a Principal Researcher at Microsoft Azure AI, primarily working on building large-scale multimodal foundation models, under Project Florence-VL. I received my Ph.D. degree from Duke University in Spring 2018. Before that, I received my Master's and B.Sc. from Peking University in 2013 and 2010, respectively. My Ph.D. advisor is Lawrence Carin. I can be reached at zhe.gan@microsoft.com.
I am serving (or, has served) as an Area Chair for NeurIPS 2022/2021/2020/2019, ICML 2022/2021, ICLR 2023/2021, ECCV 2022, ACL 2022/2021, NAACL 2022, EMNLP 2022, AAAI 2023/2022, and a Senior Program Committee (SPC) member for AAAI 2021/2020, and received AAAI-20 Outstanding SPC Award. Together with my co-authors, I have also received the Best Student Paper Honorable Mention Awards at CVPR 2021 and WACV 2021, respectively.
Research Highlights:
- [2022/7] NUWA Infinity is our new multimodal generative model that is able to generate high-quality images and videos from given text or image input. We can generate images with resolution up to 38912 × 2048 pixels. Check our project website here.
- [2022/6] We held a tutorial on recent advances on vision-language pre-training at CVPR 2022. All our slides are available at our tutorial website now.
- [2022/6] Florence-GIT is our new multimodal generative foundation model, where we have trained a simple image-to-text transformer on 800M image-text pairs. GIT achieves new sota across 12 image/video captioning and QA tasks, including the first human-parity on TextCaps. GIT achieves an accuracy of 88.79% on ImageNet-1k using a generative scheme. GIT can recognize logos, landmarks, characters, etc.
- [2022/3] 4 papers accepted to CVPR 2022, including (i) METER, an end-to-end transformer-based VLP model; (ii) LEMON, scaling up VLP for image captioning; (iii) SwinBERT, video-Swin-based video captioning model; and (iv) ViTCAP, ViT-based image captioning model.
- [2021/10] 3 papers accepted to the main conference track of NeurIPS 2021, including Sparse ViT, Elastic LTH, and GAN lottery tickets; and 2 papers accepted to the datasets and benchmarks track of NeurIPS 2021, including AdvGLUE and VALUE.
- [2021/10] During the summer, we have hosted a special Vision-Language Talk Series. With 11 invited speakers from both academia and industry, we have covered diverse topics ranging from image captioning, VQA, multimodal pre-training (ALIGN, MDETR), grounded visual generation, zero-shot object detection (ViLD), video-language understanding (MERLOT), to self-supervised learning (MoCo-v3). Want to know more? Please check the YouTube playlist and MSR video series.
- [2021/09] We all know GPT-3 is a strong few-shot learner for NLP problems, but can it also benefit multimodal tasks? In this new work, we provide an empirical study of GPT-3 for knowledge-based VQA, and show that prompting GPT-3 via the use of image captions with only 16 examples surpasses supervised sota by an absolute +8.6 points on the OK-VQA dataset (from 39.4 to 48.0).
- [2021/07] Our Adversarial VQA work is accepted by ICCV 2021 as an Oral paper (top 3% among all submissions).
- [2021/06] Our ClipBERT paper wins the Best Student Paper Honorable Mention Award at CVPR 2021.
- [2021/06] 4 pieces of updates on our recent vision-and-language efforts: (i) Our CVPR 2021 tutorial will happen on 6/20; (ii) Our VALUE benchmark and competition has been launched; (iii) The arXiv version of our Adversarial VQA benchmark has been released; (iv) We are the winner of TextCaps Challenge 2021.
- [2021/05] Two papers accepted by ACL 2021: one long paper in the Main Conference track and the other in the Findings track. Topics include (i) EarlyBERT for efficient BERT training, and (ii) Cluster-Former as an efficient transformer variant for question answering.
- [2021/03] Two papers accepted by CVPR 2021, and one paper accepted to NAACL 2021. Topics include (i) ClipBERT for video-and-language learning (Oral with 3 strong accepts), (ii) enhancing contrastive knowledge distillation with Wasserstein learning, and (iii) text generation from hyperbolic space.
- [2021/01] Our Meta Module Network wins the Best Student Paper Honorable Mention Award at WACV 2021.
© July 2022 Zhe Gan