SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks (AIware 2025 - Benchmark & Dataset Track)

Who

Sanket Mhatre, Yasharth Bajpai, Sumit Gulwani, Emerson Murphy-Hill, Gustavo Soares

Track

AIware 2025 Benchmark & Dataset Track

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 20 Nov 2025 15:24 - 15:29 at Grand Hall 1 - Evaluation Frameworks, and Quantitative Assessment of LLMs (Part 1) Chair(s): Zhou Yang

Abstract

AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi SWE-Bench. However, C#– a prominent enterprise language ranking #5 in the TIOBE index– remains absent from such benchmarks. We introduce SWE-Sharp-Bench, a reproducible software engineering benchmark for C# featuring 150 instances from 17 repositories. Evaluating identical model-agent configu rations across languages reveals a significant performance gap: while 70% of Python tasks in SWE-Bench Verified are solved, only 40% of our C# tasks are resolved. We open-source SWE Sharp-Bench and our entire curation pipeline.

Link to Preprint

https://arxiv.org/pdf/2511.02352

Sanket Mhatre

Microsoft

India

Yasharth Bajpai

Microsoft

India

Sumit Gulwani

Microsoft

United States

Emerson Murphy-Hill

Microsoft

United States

Gustavo Soares

Microsoft

United States

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 20 Nov
Displayed time zone: Seoul change

15:00 - 15:29	Evaluation Frameworks, and Quantitative Assessment of LLMs (Part 1)Benchmark & Dataset Track / Main Track at Grand Hall 1 Chair(s): Zhou Yang University of Alberta, Alberta Machine Intelligence Institute

15:00 8m Talk		Automated Extract Method Refactoring with Open-Source LLMs: A Comparative Study Main Track Sivajeet Chand Technical University of Munich, Melih Kilic Technical University of Munich, Roland Würsching Technical University of Munich, Sushant Kumar Pandey University of Groningen, The Netherlands, Alexander Pretschner TU Munich Pre-print
15:08 8m Talk		Benchmarking Web API Integration Code Generation Benchmark & Dataset Track Daniel Maninger TU Darmstadt, Leon Chemnitz TU Darmstadt, Amir Molzam Sharifloo , Mira Mezini TU Darmstadt; hessian.AI; National Research Center for Applied Cybersecurity ATHENE Pre-print
15:16 8m Talk		From Search to Reasoning: A Five-Level RAG Capability Framework for Enterprise Data Benchmark & Dataset Track Gurbinder Gill , Ritvik Gupta Carnegie Mellon University, USA, Denis Lusson , Anand Chandrashekar , Donald Nguyen Corvic AI Pre-print
15:24 5m Talk		SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks Benchmark & Dataset Track Sanket Mhatre Microsoft, Yasharth Bajpai Microsoft, Sumit Gulwani Microsoft, Emerson Murphy-Hill Microsoft, Gustavo Soares Microsoft Pre-print