Guidelines for Empirical Studies in Software Engineering involving Large Language Models (AIware 2025 - ArXiv Track)

Who

Sebastian Baltes, Florian Angermeir, Chetan Arora, Marvin Muñoz Barón, Chunyang Chen, Lukas Böhme, Fabio Calefato, Neil Ernst, Davide Falessi, Brian Fitzgerald, Davide Fucci, Marcos Kalinowski, Stefano Lambiase, Daniel Russo, Mircea Lungu, Lutz Prechelt, Paul Ralph, Christoph Treude, Stefan Wagner

Track

AIware 2025 ArXiv Track

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 20 Nov 2025 12:03 - 12:08 at Grand Hall 1 - Future of AIware Chair(s): Haoye Tian

Abstract

Large language models (LLMs) are increasingly being integrated into software engineering (SE) research and practice, yet their non-determinism, opaque training data, and evolving architectures complicate the reproduction and replication of empirical studies. We present a community effort to scope this space, introducing a taxonomy of LLM-based study types together with eight guidelines for designing and reporting empirical studies involving LLMs. The guidelines present essential (must) criteria as well as desired (should) criteria and target transparency throughout the research process. Our recommendations, contextualized by our study types, are: (1) to declare LLM usage and role; (2) to report model versions, configurations, and fine-tuning; (3) to document tool architectures; (4) to disclose prompts and interaction logs; (5) to use human validation; (6) to employ an open LLM as a baseline; (7) to use suitable baselines, benchmarks, and metrics; and (8) to openly articulate limitations and mitigations. Our goal is to enable reproducibility and replicability despite LLM-specific barriers to open science. We maintain the study types and guidelines online as a living resource for the community to use and shape.

Link to Preprint

https://arxiv.org/abs/2508.15503

Sebastian Baltes

Heidelberg University

Germany

Florian Angermeir

fortiss GmbH

Chetan Arora

Monash University

Australia

Marvin Muñoz Barón

Technical University of Munich

Germany

Chunyang Chen

TU Munich

Germany

Lukas Böhme

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany

Germany

Fabio Calefato

University of Bari

Italy

Neil Ernst

University of Victoria

Canada

Davide Falessi

University of Rome Tor Vergata, Italy

Italy

Brian Fitzgerald

Lero - The Irish Software Research Centre and University of Limerick

Ireland

Davide Fucci

Blekinge Institute of Technology

Sweden

Marcos Kalinowski

Pontifical Catholic University of Rio de Janeiro (PUC-Rio)

Brazil

Stefano Lambiase

Department of Computer Science, Aalborg University, Denmark

Denmark

Daniel Russo

Department of Computer Science, Aalborg University

Denmark

Mircea Lungu

IT University, Copenhagen

Lutz Prechelt

Freie Universität Berlin

Germany

Paul Ralph

Dalhousie University

Canada

Christoph Treude

Singapore Management University

Singapore

Stefan Wagner

Technical University of Munich

Germany

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 20 Nov
Displayed time zone: Seoul change

11:50 - 12:30	Future of AIwareBenchmark & Dataset Track / Main Track / ArXiv Track at Grand Hall 1 Chair(s): Haoye Tian Aalto University

11:50 8m Talk		The Future of Generative AI in Software Engineering: A Vision from Industry and Academia in the European GENIUS Project Main Track Robin Gröpler ifak - Institute for Automation and Communication, Magdeburg, Steffen Klepke Siemens AG, Jack Johns BT Group PLC, Andreas Dreschinski Akkodis, Klaus Schmid , Benedikt Dornauer University of Innsbruck; University of Cologne, Eray Tüzün Bilkent University, Joost Noppen , Mohammad Reza Mousavi King's College London, Yongjian Tang Siemens AG, Germany, Johannes Viehmann Fraunhofer FOKUS, Germany, Selin Şirin Aslangül , Beum Seuk Lee BT Group PLC, Adam Ziolkowski BT, Eric Zie Pre-print
11:58 5m Talk		Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks Benchmark & Dataset Track
12:03 5m Talk		Guidelines for Empirical Studies in Software Engineering involving Large Language Models ArXiv Track Sebastian Baltes Heidelberg University, Florian Angermeir fortiss GmbH, Chetan Arora Monash University, Marvin Muñoz Barón Technical University of Munich, Chunyang Chen TU Munich, Lukas Böhme Hasso Plattner Institute, University of Potsdam, Potsdam, Germany, Fabio Calefato University of Bari, Neil Ernst University of Victoria, Davide Falessi University of Rome Tor Vergata, Italy, Brian Fitzgerald Lero - The Irish Software Research Centre and University of Limerick, Davide Fucci Blekinge Institute of Technology, Marcos Kalinowski Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Stefano Lambiase Department of Computer Science, Aalborg University, Denmark, Daniel Russo Department of Computer Science, Aalborg University, Mircea Lungu IT University, Copenhagen, Lutz Prechelt Freie Universität Berlin, Paul Ralph Dalhousie University, Christoph Treude Singapore Management University, Stefan Wagner Technical University of Munich Pre-print
12:10 20m Live Q&A		Joint Q&A and Discussion #FutureofAIware Main Track