The problem
Interview debriefs over-index on charisma, recency, or whichever interviewer speaks most confidently. The hardest dimensions — agency, judgment under ambiguity, real craft motivation — get evaluated least consistently.
What I built
Gabbar takes a transcript and outputs a structured scorecard: risk classification, scaling assessment (will this person grow with the environment or plateau), and explicit bias flags.
The doctrine is 7 layers deep. Each layer targets a specific evaluation dimension. The system forces the debrief conversation to address what actually matters instead of defaulting to gut feel.
Design decision
Built explicit bias safeguards into the evaluation loop. The system flags when a score might be driven by pattern-matching on demographics rather than evidence. Most evaluation tools skip this entirely.