A Tool for Benchmarking Large Language Models' Robustness in Assessing the Realism of Driving Scenarios (AIware 2025 - Benchmark & Dataset Track)

Who

Jiahui Wu, Chengjie Lu, Aitor Arrieta, Shaukat Ali

Track

AIware 2025 Benchmark & Dataset Track

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 20 Nov 2025 16:24 - 16:29 at Grand Hall 1 - Evaluation Frameworks, and Quantitative Assessment of LLMs (Part 2) Chair(s): Zhou Yang

Link to Preprint

https://arxiv.org/abs/2511.04267

Jiahui Wu

Simula Research Laboratory and University of Oslo

Norway

Chengjie Lu

Simula Research Laboratory and University of Oslo

Norway

Aitor Arrieta

Mondragon University

Spain

Shaukat Ali

Simula Research Laboratory and Oslo Metropolitan University

Norway

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 20 Nov
Displayed time zone: Seoul change

16:00 - 16:50	Evaluation Frameworks, and Quantitative Assessment of LLMs (Part 2)Main Track / Benchmark & Dataset Track at Grand Hall 1 Chair(s): Zhou Yang University of Alberta, Alberta Machine Intelligence Institute

16:00 8m Talk		PromptExp: Multi-granularity Prompt Explanation of Large Language Models Main Track Ximing Dong Centre for Software Excellence at Huawei Canada, Shaowei Wang University of Manitoba, Dayi Lin Centre for Software Excellence, Huawei Canada, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Ahmed E. Hassan Queen’s University
16:08 8m Talk		Beyond Code Explanations: A Ray of Hope for Cross-Language Vulnerability Repair Main Track Kevin Lira North Carolina State University, Baldoino Fonseca Universidade Federal de Alagoas, Wesley K.G. Assunção North Carolina State University, Davy Baía Federal University of Alagoas, Márcio Ribeiro Federal University of Alagoas, Brazil Pre-print
16:16 8m Talk		Secure Code Generation at Scale with Reflexion Benchmark & Dataset Track Arup Datta University of North Texas, Ahmed Aljohani University of North Texas, Hyunsook Do University of North Texas Pre-print
16:24 5m Talk		A Tool for Benchmarking Large Language Models' Robustness in Assessing the Realism of Driving Scenarios Benchmark & Dataset Track Jiahui Wu Simula Research Laboratory and University of Oslo, Chengjie Lu Simula Research Laboratory and University of Oslo, Aitor Arrieta Mondragon University, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University Pre-print
16:29 21m Live Q&A		Joint Q&A and Discussion #LLMAssessment Main Track