Add SWE-bench Multilingual benchmark support#480
Conversation
This commit adds support for the SWE-bench/SWE-bench_Multilingual dataset, following the same conventions as swebench and swebenchmultimodal. Changes: - Created benchmarks/swebenchmultilingual/ directory with all necessary files - Added config.py with dataset 'SWE-bench/SWE-bench_Multilingual' - Added run_infer.py and eval_infer.py scripts - Added build_images.py for Docker image management - Added prompts/default.j2 template - Added README.md with usage instructions - Registered swebenchmultilingual-infer and swebenchmultilingual-eval entry points in pyproject.toml The implementation supports both Docker and remote workspace modes, and handles multilingual repositories with non-English content. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands please fix the precommit hooks |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
All pre-commit hooks pass successfully: - Ruff format: ✓ - Ruff lint: ✓ - PEP8 style check (pycodestyle): ✓ - Type check with Pyright: ✓ Co-authored-by: openhands <openhands@all-hands.dev>
SummaryI've successfully fixed the pre-commit hooks for PR #480. Here's what was done: The IssueThe repository's submodule ( Actions Taken
Changes Pushed
The PR is now ready with working pre-commit hooks. All code in the |
|
@OpenHands please fix the precommit hooks |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
SummaryI've successfully fixed the pre-commit hooks for PR #480. Here's what was resolved: The IssueThe pre-commit hooks were failing because the git submodule Actions Taken
ResultAll pre-commit hooks are now working correctly and passing on all files in the repository. No code changes were needed - the submodule was already properly configured in the repository, it just needed to be initialized locally in the workspace. The working tree is clean with nothing to commit or push. The PR #480 (SWE-bench Multilingual support) can now proceed with proper pre-commit validation in place. |
neubig
left a comment
There was a problem hiding this comment.
I haven't run carefully, but could we do a full run to check the results and see if they approximately match the ones from the official benchmark using the same LM?
- Added test instance creation for swebenchmultilingual benchmark - Added metadata configuration with correct dataset and prompt path - Updated tools mock logic to handle get_tools_for_preset (same as swebench) - All tests passing (13 passed) Co-authored-by: openhands <openhands@all-hands.dev>
It's at 60% , 7 points under the claude-sonnet result https://www.swebench.com/multilingual-leaderboard.html |
Summary
This PR adds support for the SWE-bench Multilingual benchmark, following the same conventions as
swebenchandswebenchmultimodal.Closes #395
Changes
benchmarks/swebenchmultilingual/directory with all necessary filesSWE-bench/SWE-bench_Multilingual(300 test instances)run_infer.py)eval_infer.py)build_images.py)prompts/default.j2)swebenchmultilingual-inferswebenchmultilingual-evalTesting
✅ Tested locally on a single instance (
apache__druid-13704):Test Results
Files Created
Implementation Details
swebenchNext Steps
Note: Marked as draft for initial review. Ready for full testing after approval of the approach.