How we use AI to score reasoning quality — not just right or wrong

Most knowledge tests score one thing: whether you got the right answer. This is fine for measuring what you know — factual recall, comprehension of a text, ability to apply a formula. But it misses almost everything relevant about how you think.

Consider two people who give the same wrong answer to a question. One wrote a careful, evidence-based rationale that almost arrived at the correct conclusion, made one plausible inferential error, and flagged their own uncertainty appropriately. The other guessed, wrote "I'm not sure but I think X," and moved on. By any binary scoring system, they both got zero. But they're not the same.

This is the problem MindFrame's AI reasoning scoring is designed to solve.

What reasoning scoring actually measures

When a user completes a MindFrame challenge, they see a question, write their reasoning in a free-text field, give an answer, and rate their confidence. The AI scoring component evaluates the reasoning text on a 0–10 scale across several dimensions:

Logical structure: Does the reasoning build coherently toward a conclusion? Are the inferential steps valid?
Evidence use: Does the reasoning engage with the actual evidence or principles relevant to the question, or does it reason by feel?
Uncertainty acknowledgment: Does the reasoning appropriately flag limitations, alternative possibilities, or degrees of confidence?
Bias detection: Does the reasoning show awareness of common biases that might apply, or does it exhibit them without acknowledgment?

Crucially, this score is calculated independently of whether the final answer is correct. A person can score 9/10 on reasoning quality and still get the question wrong (good process, bad outcome — common in genuinely hard questions). A person can score 2/10 on reasoning and guess correctly (bad process, lucky outcome).

Why independent scoring matters

The separation of reasoning score from outcome score is not a technicality — it's the entire point. Here's why:

If you only reward correct answers, you train people to optimize for getting the right answer by any means necessary — including guessing, pattern-matching to surface features, or memorizing templates. These strategies look good in the short run but don't transfer to novel situations.

If you score reasoning quality independently, you create an incentive to develop genuine thinking processes — the kind that transfer. A user who consistently scores 8/10 on reasoning but only 60% accuracy is on a much better learning trajectory than someone who scores 90% accuracy with 3/10 reasoning quality (because the latter may be domain-pattern-matching that will fail in new contexts).

How we implemented it with Claude

We use Anthropic's Claude model to evaluate reasoning text. The prompt system gives Claude the question, the correct answer, the reasoning rubric, and the user's response. Claude returns a structured evaluation with a score and brief feedback on specific dimensions.

Several design decisions were important:

Separating evaluation from answer judgment. The prompt explicitly instructs Claude to evaluate reasoning quality independent of whether the conclusion is correct. This required careful prompt engineering — by default, LLMs tend to rate reasoning that arrives at the correct conclusion more favorably, even controlling for process quality.

Calibrating for length bias. Longer reasoning texts tend to score higher even when they're not better. We built explicit length-normalization into the prompt to ensure a concise, incisive 2-sentence response can score as highly as a thoughtful paragraph.

Caching for cost control. AI scoring is run once per challenge completion and cached — the score for a given reasoning text never needs to be recomputed. Combined with Redis caching of identical prompts, this keeps AI inference costs manageable at scale.

What users see

After each session, users receive a three-component score: outcome accuracy (how often were you right), reasoning quality (how well did you think, per the AI evaluation), and calibration (how well did your stated confidence match your accuracy). A composite Precision Score combines all three.

The reasoning score includes brief natural-language feedback — not just a number, but a short description of what the reasoning did well and where it fell short. This feedback is the instructional signal that drives learning; the number alone isn't enough.

What this enables that binary scoring doesn't

With three independent scores, MindFrame can identify patterns that binary scoring misses: users who are accurate but overconfident (calibration gap), users who reason well but reach wrong conclusions (knowledge gaps rather than reasoning gaps), users who guess correctly but can't explain why (brittle knowledge).

Each pattern maps to different interventions. A calibration gap suggests confidence-targeted practice. A knowledge gap suggests content focus rather than process work. Brittle knowledge suggests challenge diversity rather than repetition in a comfortable domain.

The reasoning score is what makes this diagnostic possible. Without it, you have a one-dimensional view of cognitive performance. With it, you have a map.