Post-training alignment makes LLMs helpful, but also introduces unintended artifacts. This talk explores two such artifacts, their impact on LLM diversity and safety, and presents corresponding solutions. (1) We begin with a data-driven artifact from RLHF, showing how a “typicality bias” in human preferences leads to mode collapse. I will introduce Verbalized Sampling, a principled prompting method that restores diversity across creative writing, social simulation, and synthetic data generation tasks. (2) Next, we shift to a mechanistic artifact from SFT, uncovering how LLMs encode “harmfulness” and “refusal” separately. This insight demystifies how jailbreaks work and enables the Latent Guard, an intrinsic safeguard built on the model’s internal beliefs. Together, these findings call for an artifact-aware approach that looks beyond surface-level behaviors when building and evaluating LLMs.
Speaker Bio
Weiyan is an assistant professor at Northeastern University, working on human-AI interaction, AI agents, and AI safety. She has been recognized as an AI2050 Early Career Fellow, MIT Technology Review’s 35 Innovators Under 35, and Rising Star awards in both Machine Learning and EECS. She has received Best Paper Nomination, Outstanding Paper, Best Social Impact Paper at ACL 2019 and ACL 2024. She co-created the first negotiation AI to achieve human-level performance in Diplomacy, with the work published in Science and featured in The New York Times, Forbes, and other major media.
More Details
- When: Wed 8 April 2026, at 11 am - 12 pm (Brisbane time)
- Speaker: Prof Weiyan Shi (Northeastern University)
- Host: Dr Ruihong Qiu
- Coordinator: Dr Zijian Wang
- Zoom: https://uqz.zoom.us/j/85868925939 [Recording]
No.26-03 OpenClaw-RL: Train Any Agent Simply by Talking