Towards Multimodal Intelligence: Bridging Vision, Language, and Large-Scale Models
Multimodal intelligence is revolutionizing document understanding by enabling AI to process and reason across vision and language. Th...
All Talks
Towards Multimodal Intelligence: Bridging Vision, Language, and Large-Scale Models
Multimodal intelligence is revolutionizing document understanding by enabling AI to process and reason across vision and language. This talk explores how large-scale models integrate ...
From Adobe, Mar 07, 2025From Next Token Prediction to Compliant AI Assistants: A Systematic Path toward Trustworthy Large Language Models
Language models are systems that can predict upcoming words” - this classical definition of NLP models forms the basis of LLMs becoming responsive text completion models. However, suc...
From UC Merced, Feb 28, 2025No.25-01 Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Exciting models have been developed in multimodal video understanding and generation, such as video LLM and video diffusion model. One emerging pathway to the ultimate intelligence is...
From NUS, Feb 27, 2025No.24-20 Controllable Visual Synthesis via Structural Representation
End-to-end neural approaches have revolutionized visual generation, producing stunning outputs from natural language prompts. However, precise controls remain challenging through dire...
From Stanford, Dec 13, 2024No.24-19 When LLMs Meet Recommendations: Scalable Hybrid Approaches to Enhance User Experiences
While LLMs offer powerful reasoning and generalization capabilities for user understanding and long-term planning in recommendation systems, their latency and cost hinder direct appli...
From Deepmind, Dec 09, 2024No.24-18 Developing Effective Long-Context Language Models
In this talk, I will share our journey behind developing an effective long-context language model. I’ll begin by introducing our initial approach of using parallel context encoding (C...
From Princeton, Dec 04, 2024No.24-17 Mitigating Distribution Shifts in Using Pre-trained Vision-Language Models
Benefiting from large-scale image-text pair datasets, powerful pre-trained vision-language models (VLMs, such as CLIP) enable many real-world applications, e.g., zero-shot classificat...
From UniMelb, Dec 02, 2024No.24-16 Towards Graph Machine Learning in the Wild
Learning on graphs is a long-standing and fundamental challenge in machine learning and recent works have demonstrated solid progress in this area. However, most existing models tacit...
From MIT, Nov 27, 2024No.24-15 Evaluation and Reasoning in Real-world Scenarios
User queries in natural settings, such as “provide a design for a disk topology for a NAS built on TrueNAS Scale, as well as a dataset layout,” differ significantly from those produce...
From Cornell University, Nov 20, 2024No.24-14 Long-Range Meets Scalability: Unveiling a Linear-Time Graph Neural Network for Recommendation at Scale
Recommender systems play a central role in shaping our daily digital experiences, yet achieving both scalability and expressive power remains a significant challenge. While Graph Neur...
From Pennsylvania State University, Nov 13, 2024