Benchmarking Web API Integration Code Generation
API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks.
Thu 20 NovDisplayed time zone: Seoul change
15:00 - 15:29 | Evaluation Frameworks, and Quantitative Assessment of LLMs (Part 1)Benchmark & Dataset Track / Main Track at Grand Hall 1 Chair(s): Zhou Yang University of Alberta, Alberta Machine Intelligence Institute | ||
15:00 8mTalk | Automated Extract Method Refactoring with Open-Source LLMs: A Comparative Study Main Track Sivajeet Chand Technical University of Munich, Melih Kilic Technical University of Munich, Roland Würsching Technical University of Munich, Sushant Kumar Pandey University of Groningen, The Netherlands, Alexander Pretschner TU Munich Pre-print | ||
15:08 8mTalk | Benchmarking Web API Integration Code Generation Benchmark & Dataset Track Daniel Maninger TU Darmstadt, Leon Chemnitz TU Darmstadt, Amir Molzam Sharifloo , Mira Mezini TU Darmstadt; hessian.AI; National Research Center for Applied Cybersecurity ATHENE Pre-print | ||
15:16 8mTalk | From Search to Reasoning: A Five-Level RAG Capability Framework for Enterprise Data Benchmark & Dataset Track Gurbinder Gill , Ritvik Gupta Carnegie Mellon University, USA, Denis Lusson , Anand Chandrashekar , Donald Nguyen Corvic AI Pre-print | ||
15:24 5mTalk | SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks Benchmark & Dataset Track Sanket Mhatre Microsoft, Yasharth Bajpai Microsoft, Sumit Gulwani Microsoft, Emerson Murphy-Hill Microsoft, Gustavo Soares Microsoft Pre-print | ||