adobe,

Towards Multimodal Intelligence: Bridging Vision, Language, and Large-Scale Models

Follow Mar 07, 2025 · 1 min read
Towards Multimodal Intelligence: Bridging Vision, Language, and Large-Scale Models
Share this

Multimodal intelligence is revolutionizing document understanding by enabling AI to process and reason across vision and language. This talk explores how large-scale models integrate textual and visual information to enhance document analysis, addressing key challenges such as multimodal fusion, layout-aware learning, and post-training optimization. We discuss strategies to improve model adaptability, including contrastive learning and retrieval-augmented fine-tuning. Finally, we highlight future directions in cross-modal reasoning and scalable architectures, aiming to advance automation and knowledge extraction in AI-driven document intelligence.

Speaker Bio

Dr. Jiuxiang Gu is a Senior Research Scientist at Adobe Research, focusing on Machine Learning Theory and Multimodal Learning. He earned his Ph.D. from Nanyang Technological University, Singapore, and has served as an Area Chair for ICLR, ACL, and WACV, alongside various program committee roles. Recognized in Stanford University’s 2023/24 list of the top 2% of scientists globally, Dr. Gu has extensive experience in integrating multimodal large models with interdisciplinary applications, particularly in processing multimodal document data. Visit his personal homepage at gujiuxiang.com

More Details

Join Newsletter
Get the latest news right in your inbox. We never spam!