Benchmarking Web API Integration Code Generation (AIware 2025 - Benchmark & Dataset Track)

Who

Daniel Maninger, Leon Chemnitz, Amir Molzam Sharifloo, Mira Mezini

Track

AIware 2025 Benchmark & Dataset Track

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 20 Nov 2025 15:08 - 15:16 at Grand Hall 1 - Evaluation Frameworks, and Quantitative Assessment of LLMs (Part 1) Chair(s): Zhou Yang

Abstract

API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks.

Link to Preprint

https://arxiv.org/abs/2509.20172v5

Daniel Maninger

TU Darmstadt

Germany

Leon Chemnitz

TU Darmstadt

Germany

Amir Molzam Sharifloo

Mira Mezini

TU Darmstadt; hessian.AI; National Research Center for Applied Cybersecurity ATHENE

Germany

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 20 Nov
Displayed time zone: Seoul change

15:00 - 15:29	Evaluation Frameworks, and Quantitative Assessment of LLMs (Part 1)Benchmark & Dataset Track / Main Track at Grand Hall 1 Chair(s): Zhou Yang University of Alberta, Alberta Machine Intelligence Institute

15:00 8m Talk		Automated Extract Method Refactoring with Open-Source LLMs: A Comparative Study Main Track Sivajeet Chand Technical University of Munich, Melih Kilic Technical University of Munich, Roland Würsching Technical University of Munich, Sushant Kumar Pandey University of Groningen, The Netherlands, Alexander Pretschner TU Munich Pre-print
15:08 8m Talk		Benchmarking Web API Integration Code Generation Benchmark & Dataset Track Daniel Maninger TU Darmstadt, Leon Chemnitz TU Darmstadt, Amir Molzam Sharifloo , Mira Mezini TU Darmstadt; hessian.AI; National Research Center for Applied Cybersecurity ATHENE Pre-print
15:16 8m Talk		From Search to Reasoning: A Five-Level RAG Capability Framework for Enterprise Data Benchmark & Dataset Track Gurbinder Gill , Ritvik Gupta Carnegie Mellon University, USA, Denis Lusson , Anand Chandrashekar , Donald Nguyen Corvic AI Pre-print
15:24 5m Talk		SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks Benchmark & Dataset Track Sanket Mhatre Microsoft, Yasharth Bajpai Microsoft, Sumit Gulwani Microsoft, Emerson Murphy-Hill Microsoft, Gustavo Soares Microsoft Pre-print