AIware 2025 - Benchmark & Dataset Track

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

You're viewing the program in a time zone which is different from your device's time zone change time zone

Wed 19 Nov
Displayed time zone: Seoul change

	09:00 - 09:20	OpeningMain Track at Grand Hall 1 Chair(s): Yiling Lou University of Illinois at Urbana-Champaign, Qinghua Lu Data61, CSIRO, Jie M. Zhang King's College London

09:20 - 10:30	AIware & SecurityMain Track at Grand Hall 1 Chair(s): Weiyi Shang University of Waterloo

09:20 8m Talk		CHASE: LLM Agents for Dissecting Malicious PyPI Packages Main Track Takaaki Toda Waseda University, Tatsuya Mori Waseda University File Attached
09:28 8m Talk		CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models Main Track Cheng Cheng Concordia University, Jinqiu Yang Concordia University Pre-print
09:36 8m Talk		Security in the Wild: An Empirical Analysis of LLM-Powered Applications and Local Inference Frameworks Main Track Julia Gomez-Rangel Texas A&M University - Corpus Christi, Young Lee Texas A & M University - San Antonio, Bozhen Liu Texas A&M University - Corpus Christi Pre-print
09:44 8m Talk		How Quantization Impacts Privacy Risk on LLMs for Code? Main Track Md Nazmul Haque North Carolina State University, Hua yang North Carolina State University, Zhou Yang University of Alberta, Alberta Machine Intelligence Institute , Bowen Xu North Carolina State University Pre-print
09:52 8m Talk		Securing the Multi-Chain Ecosystem: A Unified, Agent-Based Framework for Vulnerability Repair in Solidity and Move Main Track Rabimba Karanjai University of Houston, Lei Xu Kent State University, Weidong Shi University of Houston
10:00 8m Talk		SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for AI-Powered Software Main Track Wenliang Shan Monash University, Michael Fu The University of Melbourne, Rui Yang Monash University and Transurban, Kla Tantithamthavorn Monash University and Atlassian Pre-print File Attached
10:10 20m Live Q&A		Joint Q&A and Discussion #AISecurity Main Track

16:00 - 16:50	Human Factors and Organizational Perspectives in AIwareMain Track at Grand Hall 1 Chair(s): Hongyu Zhang Chongqing University

16:00 8m Research paper		Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming Main Track Rufeng Chen McGill University, Shuaishuai Jiang , Jiyun Shen , AJung Moon McGill University, Lili Wei McGill University Pre-print
16:08 8m Talk		Human to Document, AI to Code: Three Case Studies of Comparing GenAI for Notebook Competitions Main Track Tasha Settewong Nara Institute of Science and Technology, Youmei Fan Nara Institute of Science and Technology, Raula Gaikovina Kula The University of Osaka, Kenichi Matsumoto Nara Institute of Science and Technology Pre-print
16:16 8m Talk		Judge the Votes: A System to Classify Bug Reports and Give Suggestions Main Track Emre Dinc Bilkent University, Eray Tüzün Bilkent University Pre-print
16:24 8m Talk		Model-Assisted and Human-Guided: Perceptions and Practices of Software Professionals Using LLMs for Coding Main Track Italo Santos University of Hawai‘i at Mānoa, Cleyton Magalhaes Universidade Federal Rural de Pernambuco, Ronnie de Souza Santos University of Calgary Pre-print
16:32 18m Live Q&A		Joint Discussion #HumanInTheLoop Main Track

16:50 - 17:35	Emerging Frontiers and Applications of AIwareMain Track / ArXiv Track at Grand Hall 1 Chair(s): Cuiyun Gao Harbin Institute of Technology, Shenzhen

16:50 5m Talk		Envisioning Future Interactive Web Development: Editing Webpage with Natural Language Main Track Dang Truong Singapore Management University, Jingyu Xiao The Chinese University of Hong Kong, Yintong Huo Singapore Management University, Singapore
16:55 5m Talk		On the Promises and Challenges of AI-Powered XR Glasses as Embodied Software Main Track Ruizhen Gu University of Sheffield, Jingqiong Zhang University of Sheffield, José Miguel Rojas University of Sheffield, Donghwan Shin University of Sheffield Pre-print
17:00 5m Talk		Multi-Objective Reinforcement Learning for Critical Scenario Generation of Autonomous Vehicles ArXiv Track Jiahui Wu Simula Research Laboratory and University of Oslo, Chengjie Lu Simula Research Laboratory and University of Oslo, Aitor Arrieta Mondragon University, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University Pre-print
17:05 5m Talk		Combining Reasoning Optimized LLMs and SMT Solvers for Automated Loop Invariant Synthesis Main Track Varun Bharti , Shashwat Jha , Dhruv Kumar BITS Pilani, Pankaj Jalote
17:10 5m Talk		Evaluating Large Language Models for Code Translation: Effects of Prompt Language and Prompt Design ArXiv Track Aamer Aljagthami , Mohammed Banabila , Musab Alshehri , Mohammed Kabini , Mohammad D. Alahmadi
17:15 20m Live Q&A		Joint Q&A and Discussion #AIwareApplications Main Track

	17:35 - 18:20	Brainstorming Panel 1: Future of SE agents and Workflow Main Track at Grand Hall 1 Chair(s): Ahmed E. Hassan Queen’s University

Thu 20 Nov
Displayed time zone: Seoul change

10:30 - 11:50	LLM-Based Software Testing and Quality AssuranceMain Track / Benchmark & Dataset Track at Grand Hall 1 Chair(s): Xiaoning Du Monash University

10:30 8m Talk		Understanding the Characteristics of LLM-Generated Property-Based Tests in Exploring Edge Cases Main Track Hidetake Tanaka Nara Institute of Science and Technology, Haruto Tanaka Nara Institute of Science and Technology, Kazumasa Shimari Nara Institute of Science and Technology, Kenichi Matsumoto Nara Institute of Science and Technology Pre-print
10:38 8m Talk		Understanding LLM-Driven Test Oracle Generation Main Track Adam Bodicoat University of Auckland, Gunel Jahangirova King's College London, Valerio Terragni University of Auckland
10:46 8m Talk		Turning Manual Tasks into Actions: Assessing the Effectiveness of Gemini-generated Selenium Tests Main Track Myron David Peixoto Federal University of Alagoas, Baldoino Fonseca Universidade Federal de Alagoas, Davy Baía Federal University of Alagoas, Kevin Lira North Carolina State University, Márcio Ribeiro Federal University of Alagoas, Brazil, Wesley K.G. Assunção North Carolina State University, Nathalia Nascimento Pennsylvania State University, Paulo Alencar University of Waterloo File Attached
10:54 8m Talk		Software Testing with Large Language Models: An Interview Study with Practitioners Main Track Maria Deolinda Cesar school, Cleyton Magalhaes Universidade Federal Rural de Pernambuco, Ronnie de Souza Santos University of Calgary
11:02 8m Talk		HPCAgentTester: A Multi-Agent LLM Approach for Enhanced HPC Unit Test Generation Main Track Rabimba Karanjai University of Houston, Lei Xu Kent State University, Weidong Shi University of Houston
11:10 8m Talk		Assertion-Aware Test Code Summarization with Large Language Models Benchmark & Dataset Track Anamul Haque Mollah University of North Texas, Ahmed Aljohani University of North Texas, Hyunsook Do University of North Texas DOI Pre-print
11:20 30m Live Q&A		Joint Q&A and Discussion #LLMforTesting Main Track

11:50 - 12:30	Future of AIwareBenchmark & Dataset Track / Main Track / ArXiv Track at Grand Hall 1 Chair(s): Haoye Tian Aalto University

11:50 8m Talk		The Future of Generative AI in Software Engineering: A Vision from Industry and Academia in the European GENIUS Project Main Track Robin Gröpler ifak - Institute for Automation and Communication, Magdeburg, Steffen Klepke Siemens AG, Jack Johns BT Group PLC, Andreas Dreschinski Akkodis, Klaus Schmid , Benedikt Dornauer University of Innsbruck; University of Cologne, Eray Tüzün Bilkent University, Joost Noppen , Mohammad Reza Mousavi King's College London, Yongjian Tang Siemens AG, Germany, Johannes Viehmann Fraunhofer FOKUS, Germany, Selin Şirin Aslangül , Beum Seuk Lee BT Group PLC, Adam Ziolkowski BT, Eric Zie Pre-print
11:58 5m Talk		Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks Benchmark & Dataset Track
12:03 5m Talk		Guidelines for Empirical Studies in Software Engineering involving Large Language Models ArXiv Track Sebastian Baltes Heidelberg University, Florian Angermeir fortiss GmbH, Chetan Arora Monash University, Marvin Muñoz Barón Technical University of Munich, Chunyang Chen TU Munich, Lukas Böhme Hasso Plattner Institute, University of Potsdam, Potsdam, Germany, Fabio Calefato University of Bari, Neil Ernst University of Victoria, Davide Falessi University of Rome Tor Vergata, Italy, Brian Fitzgerald Lero - The Irish Software Research Centre and University of Limerick, Davide Fucci Blekinge Institute of Technology, Marcos Kalinowski Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Stefano Lambiase Department of Computer Science, Aalborg University, Denmark, Daniel Russo Department of Computer Science, Aalborg University, Mircea Lungu IT University, Copenhagen, Lutz Prechelt Freie Universität Berlin, Paul Ralph Dalhousie University, Christoph Treude Singapore Management University, Stefan Wagner Technical University of Munich Pre-print
12:10 20m Live Q&A		Joint Q&A and Discussion #FutureofAIware Main Track

15:00 - 15:29	Evaluation Frameworks, and Quantitative Assessment of LLMs (Part 1)Benchmark & Dataset Track / Main Track at Grand Hall 1 Chair(s): Zhou Yang University of Alberta, Alberta Machine Intelligence Institute

15:00 8m Talk		Automated Extract Method Refactoring with Open-Source LLMs: A Comparative Study Main Track Sivajeet Chand Technical University of Munich, Melih Kilic Technical University of Munich, Roland Würsching Technical University of Munich, Sushant Kumar Pandey University of Groningen, The Netherlands, Alexander Pretschner TU Munich Pre-print
15:08 8m Talk		Benchmarking Web API Integration Code Generation Benchmark & Dataset Track Daniel Maninger TU Darmstadt, Leon Chemnitz TU Darmstadt, Amir Molzam Sharifloo , Mira Mezini TU Darmstadt; hessian.AI; National Research Center for Applied Cybersecurity ATHENE Pre-print
15:16 8m Talk		From Search to Reasoning: A Five-Level RAG Capability Framework for Enterprise Data Benchmark & Dataset Track Gurbinder Gill , Ritvik Gupta Carnegie Mellon University, USA, Denis Lusson , Anand Chandrashekar , Donald Nguyen Corvic AI Pre-print
15:24 5m Talk		SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks Benchmark & Dataset Track Sanket Mhatre Microsoft, Yasharth Bajpai Microsoft, Sumit Gulwani Microsoft, Emerson Murphy-Hill Microsoft, Gustavo Soares Microsoft Pre-print

16:00 - 16:50	Evaluation Frameworks, and Quantitative Assessment of LLMs (Part 2)Main Track / Benchmark & Dataset Track at Grand Hall 1 Chair(s): Zhou Yang University of Alberta, Alberta Machine Intelligence Institute

16:00 8m Talk		PromptExp: Multi-granularity Prompt Explanation of Large Language Models Main Track Ximing Dong Centre for Software Excellence at Huawei Canada, Shaowei Wang University of Manitoba, Dayi Lin Centre for Software Excellence, Huawei Canada, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Ahmed E. Hassan Queen’s University
16:08 8m Talk		Beyond Code Explanations: A Ray of Hope for Cross-Language Vulnerability Repair Main Track Kevin Lira North Carolina State University, Baldoino Fonseca Universidade Federal de Alagoas, Wesley K.G. Assunção North Carolina State University, Davy Baía Federal University of Alagoas, Márcio Ribeiro Federal University of Alagoas, Brazil Pre-print
16:16 8m Talk		Secure Code Generation at Scale with Reflexion Benchmark & Dataset Track Arup Datta University of North Texas, Ahmed Aljohani University of North Texas, Hyunsook Do University of North Texas Pre-print
16:24 5m Talk		A Tool for Benchmarking Large Language Models' Robustness in Assessing the Realism of Driving Scenarios Benchmark & Dataset Track Jiahui Wu Simula Research Laboratory and University of Oslo, Chengjie Lu Simula Research Laboratory and University of Oslo, Aitor Arrieta Mondragon University, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University Pre-print
16:29 21m Live Q&A		Joint Q&A and Discussion #LLMAssessment Main Track

16:50 - 17:35	Responsible, Ethical, and Legal Dimensions of AIwareMain Track at Grand Hall 1 Chair(s): Jingzhi Gong

16:50 8m Talk		Neuro-Symbolic Compliance: Integrating LLMs and SMT for Automated Financial Legal Analysis Main Track Yung Shen HSIA National Chengchi University, Fang Yu National Chengchi University, Jie-Hong Roland Jiang National Taiwan University File Attached
16:58 8m Talk		Are We Aligned? A Preliminary Investigation of the Alignment of Responsible AI Values between LLMs and Human Judgment Main Track Asma Yamani King Fahd University of Petroleum and Minerals, Malak Baslyman King Fahd University of Petroleum & Minerals, Moataz Ahmed King Fahd University of Petroleum and Minerals File Attached
17:06 5m Talk		A Vision for Value-Aligned AI-Driven Systems Main Track Humphrey Obie Monash University
17:11 5m Talk		Generative AI and Empirical Software Engineering: A Paradigm Shift Main Track Christoph Treude Singapore Management University, Margaret-Anne Storey University of Victoria Pre-print
17:16 19m Live Q&A		Joint Discussion #ResponsibleAI Main Track

	17:35 - 18:20	Brainstorming Panel 2: Critical Challenges in AIware & Possible SolutionsMain Track at Grand Hall 1 Chair(s): Dayi Lin Centre for Software Excellence, Huawei Canada

	18:20 - 18:40	Awards and ClosingMain Track at Grand Hall 1

Accepted Papers

	Title
	Assertion-Aware Test Code Summarization with Large Language Models Benchmark & Dataset Track Anamul Haque Mollah, Ahmed Aljohani, Hyunsook Do DOI Pre-print
	A Tool for Benchmarking Large Language Models' Robustness in Assessing the Realism of Driving Scenarios Benchmark & Dataset Track Jiahui Wu, Chengjie Lu, Aitor Arrieta, Shaukat Ali Pre-print
	Benchmarking Web API Integration Code Generation Benchmark & Dataset Track Daniel Maninger, Leon Chemnitz, Amir Molzam Sharifloo, Mira Mezini Pre-print
	From Search to Reasoning: A Five-Level RAG Capability Framework for Enterprise Data Benchmark & Dataset Track Gurbinder Gill, Ritvik Gupta, Denis Lusson, Anand Chandrashekar, Donald Nguyen Pre-print
	Secure Code Generation at Scale with Reflexion Benchmark & Dataset Track Arup Datta, Ahmed Aljohani, Hyunsook Do Pre-print
	SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks Benchmark & Dataset Track Sanket Mhatre, Yasharth Bajpai, Sumit Gulwani, Emerson Murphy-Hill, Gustavo Soares Pre-print
	Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks Benchmark & Dataset Track

Call for Papers

The AIWare Datasets and Benchmarks track invites high quality publications on highly valuable datasets and benchmarks crucial for the development and continuous improvement of AIware. Such datasets and benchmarks are essential for development and evaluation of AIware and their evolution. This track encourages high quality datasets and benchmarks for development and assessment of AIware in the following areas:

Data papers that include:

New datasets, or carefully and thoughtfully designed (collections of) datasets based on previously available data tailored for AIware.
Data generators and reinforcement learning environments.
Data-centric AI methods and tools, e.g. to measure and improve data quality or utility, or studies in data-centric AI that bring important new insights.
Advanced practices in data collection and curation are of general interest even if the data itself cannot be shared.
Frameworks for responsible dataset development, audits of existing datasets, and identifying significant problems with existing datasets and their use.
Tools and best practices to enhance dataset creation, documentation, metadata standards, ethical data handling (e.g., licensing, privacy), and accessibility.

Benchmarking papers are expected to include:

Benchmarks on new or existing metrics, as well as benchmarking tools.
Systematic analyses of existing systems on novel datasets yield important new insights.
Establish meaningful benchmarks that drive progress in performance, robustness, fairness, reliability, and usability of AIware tools.

Topics of interest

Topics of interest fall under the topics of interest of AIware conference with an emphasis on the scope for dataset and benchmark papers explained above.

Submissions

AIware 2025 Benchmark and Dataset Track welcomes submissions from both academia and industry. At least one author from each accepted submission will be required to attend the conference and present the paper.

NEW:

Short papers: Submissions are 4 pages including references.
Long papers: Page limits: 6 - 8 pages including references.

At the time of submission, the papers should disclose (anonymized and curated) data/benchmarks to increase reproducibility and replicability.

All submissions must be in English and PDF. The page limit is strict, and it will not be possible to purchase additional pages at any point in the process (including after acceptance).

Submission guidelines follows the guidelines in the main track of AIware conference. Papers must be submitted electronically in OpenReview platform through the following submission site: https://openreview.net/group?id=ACM.org/AIWare/2025/Data_and_Benchmark_Track

Authors are required to sign up active OpenReview accounts for submission. (Institutional email is recommended for registration otherwise it might take a couple of days for OpenReview to manually activate the account.) More information about OpenReview is provided in the AIware conference main track page.

Review and evaluation process

Authors are encouraged to follow a double-anonymous review process in the submission. However, single anonymity is also allowed, which reveals the authors’ identities, but not reviewers.

Evaluation criteria:

For Data papers:

Novelty: originality of the dataset or tool and clarity of relation with related work
Impact: value, usefulness, and reusability of the datasets or tool
Relevance: the relevance of the proposed demonstration for the AIware audience
Presentation: quality of the presentation
Open Usage: accessibility of the datasets or tool, i.e., the data/tool can be found and obtained without a personal request, and any required code should be open source

For Benchmarking papers:

Novelty: the originality of its underlying ideas and clarity of relation with related work
Impact: the outreach of the proposed tool, metric or dataset and the usefulness of the results
Relevance: the relevance of the proposed demonstration for the AIware audience
Presentation: the quality of the presentation
Open Usage: accessibility of the datasets, metrics, or tools, i.e., the data/tool/metric can be found and obtained without a personal request, and any required code should be open source

Awards

AIware Distinguished Dataset (or Benchmark) Award: given to the best full length paper accepted in the Benchmark and Dataset track.

Benchmark & Dataset TrackAIware 2025

Program Display Configuration

Wed 19 NovDisplayed time zone: Seoul change

Thu 20 NovDisplayed time zone: Seoul change

Accepted Papers

Call for Papers

Topics of interest

Submissions

Review and evaluation process

Awards

Fatemeh Hendijani Fard

University of British Columbia, Okanagan

Canada

Fuxiang Chen

University of Leicester, UK

Moataz Chouchen

Department of Electrical and Computer Engineering, Concordia University, Montreal, Canada

Tunisia

Antonio Della Porta

University of Salerno

Italy

Dong Jae Kim

DePaul University

United States

Jacob Krüger

Eindhoven University of Technology

Netherlands

Hongyu Kuang

Nanjing University

China

Hao Li

Queen's University

Canada

Mariam El Mezouar

Royal Military College

Canada

Alessandro Midolo

University of Catania

Italy

Nathalia Nascimento

Pennsylvania State University

United States

Alessandra Parziale

Gran Sasso Science Institute

Italy

Shahd Seddik

University of British Columbia

Canada

Gianmario Voria

University of Salerno

Italy

Yanlin Wang

Sun Yat-sen University

China

Jie JW Wu

University of British Columbia (UBC)

Canada

Zejun Zhang

Nanyang Technological University

Singapore

Wed 19 Nov
Displayed time zone: Seoul change

Thu 20 Nov
Displayed time zone: Seoul change