LLM Arena 🏆

Explore rankings and head-to-head performance comparisons of large language models based on human evaluations

ELO Rankings

Model performance rankings based on head-to-head comparisons and community evaluations

Rank	Model	ELO Rating
#1 🥇	gpt-4.1 gpt-4.1	1299
#2 🥈	claude-3-5-sonnet-20241022 claude-3-5-sonnet-20241022	1271
#3 🥉	gemini-1.5-flash-8b gemini-1.5-flash-8b	1256
#4	Apertus3-70B_iter_564750-tulu3-sft Apertus3-70B_iter_564750-tulu3-sft	1179
#5	70b-4T-sft apertus3-70b-4T-sft	1179
#6	70b-iter_304250 apertus3-70b-iter_304250	1175
#7	70b-iter_90000 apertus3-70b-iter_90000	1163
#8	8b-6.3T-sft apertus3-8b-6.3T-sft	1142
#9	8b-iter_90000 apertus3-8b-iter_90000	1135

ELO Rating System: Higher ratings indicate better performance in head-to-head comparisons. Ratings are calculated based on wins, losses, and the relative strength of opponents.

Head-to-Head Matrix

Interactive matrix showing win rates between models. Each cell shows how often the row model beats the column model.

	gpt-4.1	claude-3-5-sonnet-20241022	gemini-1.5-flash-8b	Apertus3-70B_iter_564750-tulu3-sft	70b-4T-sft	70b-iter_304250	70b-iter_90000	8b-6.3T-sft	8b-iter_90000
gpt-4.1	—	68%	70%	95%	94%	94%	95%	96%	96%
claude-3-5-sonnet-20241022	32%	—	54%	86%	88%	89%	91%	91%	93%
gemini-1.5-flash-8b	30%	46%	—	80%	80%	80%	84%	89%	88%
Apertus3-70B_iter_564750-tulu3-sft	5%	14%	20%	—	50%	52%	58%	66%	70%
70b-4T-sft	6%	12%	20%	50%	—	51%	58%	66%	71%
70b-iter_304250	6%	11%	20%	48%	49%	—	55%	67%	66%
70b-iter_90000	5%	9%	16%	42%	42%	45%	—	59%	65%
8b-6.3T-sft	4%	9%	11%	34%	34%	33%	41%	—	54%
8b-iter_90000	4%	7%	12%	30%	29%	34%	35%	46%	—

Higher Win Rate

Same Model

Lower Win Rate

Models Compared

28980

Total Comparisons

Unique Matchups