LLM Arena 🏆
Explore rankings and head-to-head performance comparisons of large language models based on human evaluations
ELO Rankings
Model performance rankings based on head-to-head comparisons and community evaluations
Rank | Model | ELO Rating |
---|---|---|
#1 🥇 | ![]() gpt-4.1 gpt-4.1 | 1299 |
#2 🥈 | ![]() claude-3-5-sonnet-20241022 claude-3-5-sonnet-20241022 | 1271 |
#3 🥉 | ![]() gemini-1.5-flash-8b gemini-1.5-flash-8b | 1256 |
#4 | Apertus3-70B_iter_564750-tulu3-sft Apertus3-70B_iter_564750-tulu3-sft | 1179 |
#5 | 70b-4T-sft apertus3-70b-4T-sft | 1179 |
#6 | 70b-iter_304250 apertus3-70b-iter_304250 | 1175 |
#7 | 70b-iter_90000 apertus3-70b-iter_90000 | 1163 |
#8 | 8b-6.3T-sft apertus3-8b-6.3T-sft | 1142 |
#9 | 8b-iter_90000 apertus3-8b-iter_90000 | 1135 |
ELO Rating System: Higher ratings indicate better performance in head-to-head comparisons. Ratings are calculated based on wins, losses, and the relative strength of opponents.
Head-to-Head Matrix
Interactive matrix showing win rates between models. Each cell shows how often the row model beats the column model.
gpt-4.1 | claude-3-5-sonnet-20241022 | gemini-1.5-flash-8b | Apertus3-70B_iter_564750-tulu3-sft | 70b-4T-sft | 70b-iter_304250 | 70b-iter_90000 | 8b-6.3T-sft | 8b-iter_90000 | |
---|---|---|---|---|---|---|---|---|---|
gpt-4.1 | — | 68% | 70% | 95% | 94% | 94% | 95% | 96% | 96% |
claude-3-5-sonnet-20241022 | 32% | — | 54% | 86% | 88% | 89% | 91% | 91% | 93% |
gemini-1.5-flash-8b | 30% | 46% | — | 80% | 80% | 80% | 84% | 89% | 88% |
Apertus3-70B_iter_564750-tulu3-sft | 5% | 14% | 20% | — | 50% | 52% | 58% | 66% | 70% |
70b-4T-sft | 6% | 12% | 20% | 50% | — | 51% | 58% | 66% | 71% |
70b-iter_304250 | 6% | 11% | 20% | 48% | 49% | — | 55% | 67% | 66% |
70b-iter_90000 | 5% | 9% | 16% | 42% | 42% | 45% | — | 59% | 65% |
8b-6.3T-sft | 4% | 9% | 11% | 34% | 34% | 33% | 41% | — | 54% |
8b-iter_90000 | 4% | 7% | 12% | 30% | 29% | 34% | 35% | 46% | — |
Higher Win Rate
Same Model
Lower Win Rate
9
Models Compared
28980
Total Comparisons
36
Unique Matchups