AIware 2025
Wed 19 - Thu 20 November 2025
co-located with ASE 2025

Context: A large number of bug reports are submitted in software projects on a daily basis, examined manually by practitioners to determine if code changes are needed (valid) or not (invalid). Invalid bug reports cause a great waste of resources, thus automated bug report classification is widely studied. However, there exists a gap using LLMs and prompting techniques for bug report classification.

Objective: The aim of this study is to apply various LLM and ML methods to classify bug reports as valid or invalid, and to determine strengths and weaknesses of the LLMs for bug report classification.

Method: We trained machine learning models on the dataset, and established a retrieval augmented generation pipeline by indexing the training data to see how it affects LLM performance in classification. Lastly, we compared the performance of the majority vote of the three models we chose and compared it with the Judge LLM which is provided both with examples and models votes.

Results: Fine-tuned transformers proved to be the most effective classifiers, with RoBERTa achieving the highest F1 score of 0.909. In contrast, zero-shot LLMs have shown a poor performance, with F1 scores ranging from 0.531 to 0.737. Although RAG-based few-shot prompting significantly boosted LLM performance, lifting the top score to 0.815, they still did not surpass the fine-tuned models. Finally, our Judge LLM pipeline yielded an F1 score of 0.871. Conclusions: Our evaluation confirms that fine-tuned semantic classifiers remain the most cost-effective and highest-accuracy solution for binary bug-report classification. Using RAG with LLMs substantially improves model performance, although the scores do not surpass our top fine-tuned models.

Wed 19 Nov

Displayed time zone: Seoul change

16:00 - 16:50
Human Factors and Organizational Perspectives in AIwareMain Track at Grand Hall 1
Chair(s): Hongyu Zhang Chongqing University
16:00
8m
Research paper
Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming
Main Track
Rufeng Chen McGill University, Shuaishuai Jiang , Jiyun Shen , AJung Moon McGill University, Lili Wei McGill University
Pre-print
16:08
8m
Talk
Human to Document, AI to Code: Three Case Studies of Comparing GenAI for Notebook Competitions
Main Track
Tasha Settewong Nara Institute of Science and Technology, Youmei Fan Nara Institute of Science and Technology, Raula Gaikovina Kula The University of Osaka, Kenichi Matsumoto Nara Institute of Science and Technology
Pre-print
16:16
8m
Talk
Judge the Votes: A System to Classify Bug Reports and Give Suggestions
Main Track
Emre Dinc Bilkent University, Eray Tüzün Bilkent University
Pre-print
16:24
8m
Talk
Model-Assisted and Human-Guided: Perceptions and Practices of Software Professionals Using LLMs for Coding
Main Track
Italo Santos University of Hawai‘i at Mānoa, Cleyton Magalhaes Universidade Federal Rural de Pernambuco, Ronnie de Souza Santos University of Calgary
Pre-print
16:32
18m
Live Q&A
Joint Discussion #HumanInTheLoop
Main Track