Exciting models have been developed in multimodal video understanding and generation, such as video LLM and video diffusion model. One emerging pathway to the ultimate intelligence is to create one single foundation model that can do both understanding and generation. After all, humans only use one brain to do both tasks. Towards such unification, recent attempts employ a base language model for multimodal understanding but require an additional pre-trained diffusion model for visual generation, which still remain as two separate components. In this work, we present Show-o, one single transformer that handles both multimodal understanding and generation. Unlike fully autoregressive models, Show-o is the first to unify autoregressive and discrete diffusion modeling, flexibly supporting a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation of any input/output format, all within one single 1.3B transformer. Across various benchmarks, Show-o demonstrates comparable or superior performance, shedding light for building the next-generation video foundation model.
Speaker Bio
Mike Shou is a tenure-track Assistant Professor (Presidential Young Professorship) at National University of Singapore. He was a Research Scientist at Facebook AI in Bay Area. He obtained his Ph.D. degree at Columbia University in the City of New York, working with Prof. Shih-Fu Chang. He received the Best Paper Finalist at CVPR’22, Best Student Paper Nomination at CVPR’17, PREMIA Best Paper Award 2023, EgoVis Distinguished Paper Award 2022/23. His team won the 1st place in the international challenges including EPIC-Kitchens 2022, Ego4D 2022 & 2023. He is a Singapore Technologies Engineering Distinguished Professor and a Fellow of National Research Foundation Singapore. He is on the Forbes 30 Under 30 Asia list.
More Details
- When: Thu 27 Feb 2025, at 1 - 2 pm (Brisbane time)
- Speaker: Prof Mike Shou (NUS)
- Host: Dr Ruihong Qiu
- Zoom: https://uqz.zoom.us/j/86003984130 [Recording will only be available internally by request.]