The problem
LLM benchmarks are run by the same companies selling the models. Marketing claims are not evidence. There was no easy way to blind-test models side by side on real prompts and see which ones actually hold up.
What I built
BlindBench is an open-source evaluation arena. It pulls real prompts from 4 Kaggle datasets, runs them through 100+ models, and scores responses on correctness, reasoning depth, and failure patterns. The interface strips model names during evaluation so judgment is unbiased.
Key design choices:
- Blind testing: Model identities hidden during evaluation to eliminate brand bias.
- Truth scoring: Composite score based on correctness, reasoning quality, and consistency.
- Failure classification: 10 distinct failure types (hallucination, logic errors, refusal bias, etc.) tracked per model.
- BYOK (Bring Your Own Key): API keys encrypted in transit, used once, never stored. Free-tier models work without any key.
Architecture
React + Vite frontend deployed to GitHub Pages. Supabase backend with Edge Functions that proxy LLM calls server-side. Materialized views power the leaderboard and failure analytics. Four seeded Kaggle datasets provide the evaluation corpus.
Distribution
- Live arena — test models immediately
- GitHub repo — full source, seed scripts, deployment guide