Leaderboard
SciCode Leaderboard
Models | Main Problem Resolve Rate | Subproblem |
---|---|---|
🥇 OpenAI o3-mini-low | 10.8 |
33.3 |
🥈 OpenAI o3-mini-high | 9.2 |
34.4 |
🥉 OpenAI o3-mini-medium | 9.2 |
33.0 |
OpenAI o1-preview | 7.7 |
28.5 |
Deepseek-R1 | 4.6 |
28.5 |
Claude3.5-Sonnet | 4.6 |
26.0 |
Claude3.5-Sonnet (new) | 4.6 |
25.3 |
Deepseek-v3 | 3.1 |
23.7 |
Deepseek-Coder-v2 | 3.1 |
21.2 |
GPT-4o | 1.5 |
25.0 |
GPT-4-Turbo | 1.5 |
22.9 |
OpenAI o1-mini | 1.5 |
22.2 |
Gemini 1.5 Pro | 1.5 |
21.9 |
Claude3-Opus | 1.5 |
21.5 |
Llama-3.1-405B-Chat | 1.5 |
19.8 |
Claude3-Sonnet | 1.5 |
17.0 |
Qwen2-72B-Instruct | 1.5 |
17.0 |
Llama-3.1-70B-Chat | 0.0 |
17.0 |
Mixtral-8x22B-Instruct | 0.0 |
16.3 |
Llama-3-70B-Chat | 0.0 |
14.6 |
Note: If the models tie in the Main Problem resolve rate, we will then compare the Subproblems.
How to submit
Want to submit your own model? Submit a request via a Github issue.