One interesting aspect of today’s generative Large Language Models (LLMs) is that they are natural polyglots, facile in many languages. These new multi-dexterous capabilities offer new potential for helping people to transcend language barriers when seeking information. In this talk I will describe three projects from our group at the Johns Hopkins University Human Language Technology Center of Excellence (HLTCOE) in which we have been exploring that potential. In the first, we explore the use of LLMs to train ColBERT to optimize ranking quality for Cross-Language Information Retrieval (CLIR). This training process benefits from large quantities of positive query-passage pairs, so the question we ask is whether generative LLM’s can generate such pairs from the collection that is ultimately to be searched, with the queries in some desired language that is different from the language in the collection. In the second, we explore whether LLMs can help to characterize the ranking quality of a CLIR system by developing test collections for new language pairs. Here our approach is use an LLM to approximate the relevance judgments that a human assessor would have made. In the third, we move beyond retrieval to the direct use of generation, exploring the potential of LLMs to write a report on a topic using source materials in multiple languages, looking both at how we might do this, and how we might know if we have done it well. This will be the focus of the new TREC RAGTIME track and a 10-week workshop at the HLTCOE this summer. This is joint work with Noah Hibbler, Efsun Kayi, Reno Kriz, Dawn Lawrie, Sean MacAvaney, Marc Mason, Jim Mayfield, Paul McNamee, Scott Miller, Ian Soboroff, Kate Sanders, Luca Soldaini, Paul Thomas, Will Walden, Orion Weller, Eugene Yang and Andrew Yates.
Speaker Bio
Doug Oard is a Professor at the University of Maryland, with joint appointments in the College of Information (the iSchool) and the University of Maryland Institute for Advanced Computer Studies (UMIACS), and an affiliate appointment at the Johns Hopkins University Human Language Technology Center of Excellence. He is perhaps best known for his research on cross-language information retrieval, but more generally one thread of his research has addressed the use of technologies such as machine translation, speech recognition, document image analysis, knowledge representation, processing mathematical notation, and social network analysis to support information access. He also has interests in applications of information retrieval in specific settings, including archival access and the “discovery” process for exchanging evidence among parties to civil litigation. Among his other current projects are (1) leveraging multiple sources of evidence to help people find content in archives that has not yet been digitized or well described, and (2) detecting inference risks when reviewing previously restricted materials for declassification. More information on Doug’s research and teaching can be found at http://terpconnect.umd.edu/~oard
More Details
- When: Fri. 28 Mar 2025, at 1-2pm (Brisbane time)
- Speaker: Professor Douglas W. Oard, University of Maryland (USA)
- Host: Dr Joel Mackenzie
- Venue: 50-N202 - Hawken Engineering Building, Learning Theatre
- Zoom: https://uqz.zoom.us/my/joelmackenzie