Leaderboard
| Models | Main Problem Resolve Rate | Subproblem |
|---|---|---|
| 🥇 OpenAI o3-mini-low | 10.8 | 33.3 |
| 🥈 OpenAI o3-mini-high | 9.2 | 34.4 |
| 🥉 OpenAI o3-mini-medium | 9.2 | 33.0 |
| OpenAI o1-preview | 7.7 | 28.5 |
| Deepseek-R1 | 4.6 | 28.5 |
| Claude3.5-Sonnet | 4.6 | 26.0 |
| Claude3.5-Sonnet (new) | 4.6 | 25.3 |
| Deepseek-v3 | 3.1 | 23.7 |
| Deepseek-Coder-v2 | 3.1 | 21.2 |
| GPT-4o | 1.5 | 25.0 |
| GPT-4-Turbo | 1.5 | 22.9 |
| OpenAI o1-mini | 1.5 | 22.2 |
| Gemini 1.5 Pro | 1.5 | 21.9 |
| Claude3-Opus | 1.5 | 21.5 |
| Llama-3.1-405B-Chat | 1.5 | 19.8 |
| Claude3-Sonnet | 1.5 | 17.0 |
| Qwen2-72B-Instruct | 1.5 | 17.0 |
| Llama-3.1-70B-Chat | 0.0 | 17.0 |
| Mixtral-8x22B-Instruct | 0.0 | 16.3 |
| Llama-3-70B-Chat | 0.0 | 14.6 |
Note: If the models tie in the Main Problem resolve rate, we will then compare the Subproblems.
How to submit
Want to submit your own model? Submit a request via a Github issue.