SLM RAG Arena - Compare and Find The Best Sub-5B Models for RAG
🏟️ This arena evaluates how well small language models (under 5B) answer questions based on document contexts.
📝 Instructions:
- Click the "Get a Question" button to load a random question with context
- Review the query and context to understand the information provided to the models
- Compare answers generated by two different models on answer quality or appropriate refusal
- Cast your vote for the better response, or select 'Tie' if equally good or 'Neither' if both are inadequate
💬 Query - Question About Document Content
Click "Get a Question" to start
This arena tests how well different AI models summarize information using standardized questions and contexts. All models see the exact same inputs for fair comparison.
We don't allow file uploads here as that would change what we're measuring. Instead, check our leaderboard to find top-performing models for your needs. We'll soon launch a separate playground where you can test models with your own files.
📋 Context - Retrieved Content from the Document
🔍 Compare Models - Are these Grounded, Complete Answers or Correct Rejections?
🏅 Cast Your Vote
✅ Vote Submitted!
Model A was:
Model B was:
SLM RAG Leaderboard
About Elo Ratings
The Elo rating system provides a more accurate ranking than simple win rates:
- All models start at 1500 points
- Points are exchanged after each comparison based on the expected outcome
- Beating a stronger model earns more points than beating a weaker one
- The ± value shows the statistical confidence interval (95%)