Guidelines for Empirical Studies in Software Engineering involving Large Language Models
Large language models (LLMs) are increasingly being integrated into software engineering (SE) research and practice, yet their non-determinism, opaque training data, and evolving architectures complicate the reproduction and replication of empirical studies. We present a community effort to scope this space, introducing a taxonomy of LLM-based study types together with eight guidelines for designing and reporting empirical studies involving LLMs. The guidelines present essential (must) criteria as well as desired (should) criteria and target transparency throughout the research process. Our recommendations, contextualized by our study types, are: (1) to declare LLM usage and role; (2) to report model versions, configurations, and fine-tuning; (3) to document tool architectures; (4) to disclose prompts and interaction logs; (5) to use human validation; (6) to employ an open LLM as a baseline; (7) to use suitable baselines, benchmarks, and metrics; and (8) to openly articulate limitations and mitigations. Our goal is to enable reproducibility and replicability despite LLM-specific barriers to open science. We maintain the study types and guidelines online as a living resource for the community to use and shape.
Thu 20 NovDisplayed time zone: Seoul change
11:50 - 12:30 | Future of AIwareBenchmark & Dataset Track / Main Track / ArXiv Track at Grand Hall 1 Chair(s): Haoye Tian Aalto University | ||
11:50 8mTalk | The Future of Generative AI in Software Engineering: A Vision from Industry and Academia in the European GENIUS Project Main Track Robin Gröpler ifak - Institute for Automation and Communication, Magdeburg, Steffen Klepke Siemens AG, Jack Johns BT Group PLC, Andreas Dreschinski Akkodis, Klaus Schmid , Benedikt Dornauer University of Innsbruck; University of Cologne, Eray Tüzün Bilkent University, Joost Noppen , Mohammad Reza Mousavi King's College London, Yongjian Tang Siemens AG, Germany, Johannes Viehmann Fraunhofer FOKUS, Germany, Selin Şirin Aslangül , Beum Seuk Lee BT Group PLC, Adam Ziolkowski BT, Eric Zie Pre-print | ||
11:58 5mTalk | Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks Benchmark & Dataset Track | ||
12:03 5mTalk | Guidelines for Empirical Studies in Software Engineering involving Large Language Models ArXiv Track Sebastian Baltes Heidelberg University, Florian Angermeir fortiss GmbH, Chetan Arora Monash University, Marvin Muñoz Barón Technical University of Munich, Chunyang Chen TU Munich, Lukas Böhme Hasso Plattner Institute, University of Potsdam, Potsdam, Germany, Fabio Calefato University of Bari, Neil Ernst University of Victoria, Davide Falessi University of Rome Tor Vergata, Italy, Brian Fitzgerald Lero - The Irish Software Research Centre and University of Limerick, Davide Fucci Blekinge Institute of Technology, Marcos Kalinowski Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Stefano Lambiase Department of Computer Science, Aalborg University, Denmark, Daniel Russo Department of Computer Science, Aalborg University, Mircea Lungu IT University, Copenhagen, Lutz Prechelt Freie Universität Berlin, Paul Ralph Dalhousie University, Christoph Treude Singapore Management University, Stefan Wagner Technical University of Munich Pre-print | ||
12:10 20mLive Q&A | Joint Q&A and Discussion #FutureofAIware Main Track | ||