llm,

No.26-04 Beyond the Surface: How Post-Training Artifacts Shape LLM Diversity and Safety

Follow Apr 08, 2026 · 1 min read
No.26-04 Beyond the Surface: How Post-Training Artifacts Shape LLM Diversity and Safety
Share this

Post-training alignment makes LLMs helpful, but also introduces unintended artifacts. This talk explores two such artifacts, their impact on LLM diversity and safety, and presents corresponding solutions. (1) We begin with a data-driven artifact from RLHF, showing how a “typicality bias” in human preferences leads to mode collapse. I will introduce Verbalized Sampling, a principled prompting method that restores diversity across creative writing, social simulation, and synthetic data generation tasks. (2) Next, we shift to a mechanistic artifact from SFT, uncovering how LLMs encode “harmfulness” and “refusal” separately. This insight demystifies how jailbreaks work and enables the Latent Guard, an intrinsic safeguard built on the model’s internal beliefs. Together, these findings call for an artifact-aware approach that looks beyond surface-level behaviors when building and evaluating LLMs.

Speaker Bio

Weiyan is an assistant professor at Northeastern University, working on human-AI interaction, AI agents, and AI safety. She has been recognized as an AI2050 Early Career Fellow, MIT Technology Review’s 35 Innovators Under 35, and Rising Star awards in both Machine Learning and EECS. She has received Best Paper Nomination, Outstanding Paper, Best Social Impact Paper at ACL 2019 and ACL 2024. She co-created the first negotiation AI to achieve human-level performance in Diplomacy, with the work published in Science and featured in The New York Times, Forbes, and other major media.

More Details

llm
Join Newsletter
Get the latest news right in your inbox. We never spam!