LLM Arena 🏆

Explore rankings and head-to-head performance comparisons of large language models based on human evaluations

ELO Rankings

Model performance rankings based on head-to-head comparisons and community evaluations

RankModelELO Rating
#1 🥇
gpt-4.1
gpt-4.1
1299
#2 🥈
claude-3-5-sonnet-20241022
claude-3-5-sonnet-20241022
1271
#3 🥉
gemini-1.5-flash-8b
gemini-1.5-flash-8b
1256
#4
Apertus3-70B_iter_564750-tulu3-sft
Apertus3-70B_iter_564750-tulu3-sft
1179
#5
70b-4T-sft
apertus3-70b-4T-sft
1179
#6
70b-iter_304250
apertus3-70b-iter_304250
1175
#7
70b-iter_90000
apertus3-70b-iter_90000
1163
#8
8b-6.3T-sft
apertus3-8b-6.3T-sft
1142
#9
8b-iter_90000
apertus3-8b-iter_90000
1135

ELO Rating System: Higher ratings indicate better performance in head-to-head comparisons. Ratings are calculated based on wins, losses, and the relative strength of opponents.

Head-to-Head Matrix

Interactive matrix showing win rates between models. Each cell shows how often the row model beats the column model.

gpt-4.1claude-3-5-sonnet-20241022gemini-1.5-flash-8bApertus3-70B_iter_564750-tulu3-sft70b-4T-sft70b-iter_30425070b-iter_900008b-6.3T-sft8b-iter_90000
gpt-4.168%70%95%94%94%95%96%96%
claude-3-5-sonnet-2024102232%54%86%88%89%91%91%93%
gemini-1.5-flash-8b30%46%80%80%80%84%89%88%
Apertus3-70B_iter_564750-tulu3-sft5%14%20%50%52%58%66%70%
70b-4T-sft6%12%20%50%51%58%66%71%
70b-iter_3042506%11%20%48%49%55%67%66%
70b-iter_900005%9%16%42%42%45%59%65%
8b-6.3T-sft4%9%11%34%34%33%41%54%
8b-iter_900004%7%12%30%29%34%35%46%
Higher Win Rate
Same Model
Lower Win Rate
9
Models Compared
28980
Total Comparisons
36
Unique Matchups