Skip to content

Leaderboard

Models Main Problem Resolve Rate Subproblem
🥇 OpenAI o3-mini-low 10.8 33.3
🥈 OpenAI o3-mini-high 9.2 34.4
🥉 OpenAI o3-mini-medium 9.2 33.0
OpenAI o1-preview 7.7 28.5
Deepseek-R1 4.6 28.5
Claude3.5-Sonnet 4.6 26.0
Claude3.5-Sonnet (new) 4.6 25.3
Deepseek-v3 3.1 23.7
Deepseek-Coder-v2 3.1 21.2
GPT-4o 1.5 25.0
GPT-4-Turbo 1.5 22.9
OpenAI o1-mini 1.5 22.2
Gemini 1.5 Pro 1.5 21.9
Claude3-Opus 1.5 21.5
Llama-3.1-405B-Chat 1.5 19.8
Claude3-Sonnet 1.5 17.0
Qwen2-72B-Instruct 1.5 17.0
Llama-3.1-70B-Chat 0.0 17.0
Mixtral-8x22B-Instruct 0.0 16.3
Llama-3-70B-Chat 0.0 14.6

Note: If the models tie in the Main Problem resolve rate, we will then compare the Subproblems.

How to submit

Want to submit your own model? Submit a request via a Github issue.