Skip to content
← Back to projects

AI Evaluation

BlindBench

Open-source arena that blind-tests 100+ AI models on real prompts from 4 Kaggle datasets. Generates trust scores, win rates, and classifies 10 failure types. BYOK support keeps API keys client-side.

2026 ReactSupabaseViteTailwindCSSKaggleOpen Source
106+ models tested4 Kaggle datasets10 failure typesZero key storage

The problem

LLM benchmarks are run by the same companies selling the models. Marketing claims are not evidence. There was no easy way to blind-test models side by side on real prompts and see which ones actually hold up.

What I built

BlindBench is an open-source evaluation arena. It pulls real prompts from 4 Kaggle datasets, runs them through 100+ models, and scores responses on correctness, reasoning depth, and failure patterns. The interface strips model names during evaluation so judgment is unbiased.

Key design choices:

  • Blind testing: Model identities hidden during evaluation to eliminate brand bias.
  • Truth scoring: Composite score based on correctness, reasoning quality, and consistency.
  • Failure classification: 10 distinct failure types (hallucination, logic errors, refusal bias, etc.) tracked per model.
  • BYOK (Bring Your Own Key): API keys encrypted in transit, used once, never stored. Free-tier models work without any key.

Architecture

React + Vite frontend deployed to GitHub Pages. Supabase backend with Edge Functions that proxy LLM calls server-side. Materialized views power the leaderboard and failure analytics. Four seeded Kaggle datasets provide the evaluation corpus.

Distribution

More Projects

See what else I've built

Hiring tools, prompt governance, automation systems, and open-source utilities.