I am a Principal Researcher at Microsoft Azure AI, primarily working on Vision-and-Language Multimodal Intelligence, under Project Florence-VL. I received my Ph.D. degree from Duke University in Spring 2018. Before that, I received my Master's and B.Sc. from Peking University in 2013 and 2010, respectively. My Ph.D. advisor is Lawrence Carin. I can be reached at zhe.gan@microsoft.com.
I am serving (or, has served) as an Area Chair for NeurIPS 2022/2021/2020/2019, ICML 2022/2021, ICLR 2021, ECCV 2022, ACL 2022/2021, NAACL 2022, AAAI 2022, and a Senior Program Committee (SPC) member for AAAI 2021/2020, and received AAAI-20 Outstanding SPC Award. Together with my co-authors, I have also received the Best Student Paper Honorable Mention Awards at CVPR 2021 and WACV 2021, respectively.
Research Highlights:
- [2022/3] 4 papers accepted to CVPR 2022, including (i) METER, an end-to-end transformer-based VLP model; (ii) LEMON, scaling up VLP for image captioning; (iii) SwinBERT, video-Swin-based video captioning model; and (iv) ViTCAP, ViT-based image captioning model.
- [2021/10] 3 papers accepted to the main conference track of NeurIPS 2021, including Sparse ViT, Elastic LTH, and GAN lottery tickets; and 2 papers accepted to the datasets and benchmarks track of NeurIPS 2021, including AdvGLUE and VALUE.
- [2021/10] During the summer, we have hosted a special Vision-Language Talk Series. With 11 invited speakers from both academia and industry, we have covered diverse topics ranging from image captioning, VQA, multimodal pre-training (ALIGN, MDETR), grounded visual generation, zero-shot object detection (ViLD), video-language understanding (MERLOT), to self-supervised learning (MoCo-v3). Want to know more? Please check the YouTube playlist and MSR video series.
- [2021/09] We all know GPT-3 is a strong few-shot learner for NLP problems, but can it also benefit multimodal tasks? In this new work, we provide an empirical study of GPT-3 for knowledge-based VQA, and show that prompting GPT-3 via the use of image captions with only 16 examples surpasses supervised sota by an absolute +8.6 points on the OK-VQA dataset (from 39.4 to 48.0).
- [2021/07] Our Adversarial VQA work is accepted by ICCV 2021 as an Oral paper (top 3% among all submissions).
- [2021/06] Our ClipBERT paper wins the Best Student Paper Honorable Mention Award at CVPR 2021.
- [2021/06] 4 pieces of updates on our recent vision-and-language efforts: (i) Our CVPR 2021 tutorial will happen on 6/20; (ii) Our VALUE benchmark and competition has been launched; (iii) The arXiv version of our Adversarial VQA benchmark has been released; (iv) We are the winner of TextCaps Challenge 2021.
- [2021/05] Two papers accepted by ACL 2021: one long paper in the Main Conference track and the other in the Findings track. Topics include (i) EarlyBERT for efficient BERT training, and (ii) Cluster-Former as an efficient transformer variant for question answering.
- [2021/03] Two papers accepted by CVPR 2021, and one paper accepted to NAACL 2021. Topics include (i) ClipBERT for video-and-language learning (Oral with 3 strong accepts), (ii) enhancing contrastive knowledge distillation with Wasserstein learning, and (iii) text generation from hyperbolic space.
- [2021/01] Our Meta Module Network wins the Best Student Paper Honorable Mention Award at WACV 2021.
© April 2022 Zhe Gan