LLM-Judge Prompt Benchmark

Empirical comparison of naive 1-10 scoring vs the research-backed Likert + CoT + bias-mitigation prompt on a public dataset

What this measures

The LLM-as-Judge Validation page argues, citing the HuggingFace LLM Judge cookbook and several research papers, that an integer Likert prompt with explicit Chain-of-Thought ordering and anti-bias guidance produces more reliable judgements than a naive "score this 1-10" prompt. This benchmark turns that claim into a number.

A self-contained research script (demo_judgebench_comparison.py) runs the same model under two prompt variants on a public dataset, compares the predicted winner against ground truth, and reports which design wins. The harness is dataset-agnostic — JudgeBench is the default because it is open, peer-reviewed, and small enough to run cheaply.

Methodology


Dataset	`ScalerLab/JudgeBench` — 350 challenging response pairs with ground-truth A>B / B>A labels, MIT licence (arXiv 2410.12784)
Sample size	50 rows (seed=42) from the `gpt` split
Judge model	`gemini-3-flash-preview` via Vertex AI, thinking budget 0
Prompt A — naive	"Score response from 1-10. Pick the winner." No CoT, no bias guidance
Prompt B — research	Likert 1-5, evaluate-before-scoring CoT, anti-verbosity / anti-formatting / anti-authority guidance
Position-bias mitigation	swap-and-average for both prompts (each pair judged twice with A/B reversed; tie when passes disagree)
Metrics	Accuracy vs ground truth, position-bias flip rate, ties, mean cost (USD), mean latency (s), score distribution

The harness is dataset-agnostic; you can pass any (question, response_a, response_b, label) source into the in-script JudgeBenchmarkHarness.run() method.

Results

Numbers below come from implementation_layer/examples/software_components/validators/judgebench-comparison/leaderboard.md in the gaik-toolkit repository, run with seed 42 on the gpt split. Re-run the demo against the same seed to reproduce.

Prompt	Accuracy	Position-bias flips	Ties	Mean cost (USD)	Mean latency (s)
`naive`	64.0%	26.0%	13	$0.0017	13.85
`research`	66.0%	22.0%	14	$0.0023	16.45

The research-backed prompt wins on the two metrics that actually matter for production use:

+2 percentage points of accuracy against the held-out ground-truth labels — small on N=50 but in the expected direction.
-4 percentage points of position-bias flips: 22 % of pairs change their predicted winner under the swap-and-average check, vs. 26 % for the naive prompt. Position-bias is the failure mode that causes silent inconsistency in production, so this is the more important number.

The trade-off is +35 % cost and +19 % latency because the research-backed prompt asks for a reason field before the score (the "evaluate first, judge second" CoT ordering). On Gemini Flash this still costs about a fifth of a cent per pair.

Score distribution — the qualitative story

The score histograms are the clearest artefact for a reader: they show qualitatively why a naive 1-10 prompt is hard to use as a metric.

naive (1-10 scale, 100 ratings = 50 pairs × 2 sides): 2 × 9, 3 × 4, 4 × 14, 5 × 6, 6 × 15, 7 × 3, 8 × 9, 9 × 14, 10 × 26
research (Likert 1-5): 1 × 3, 2 × 36, 3 × 7, 4 × 21, 5 × 33

The naive prompt piles 26 % of all ratings on the maximum value 10 and another 14 on 9 — almost half the responses get a 9 or 10. The middle of the scale (3-7) is sparsely populated. This is the classic "LLMs oversaturate at the high end of continuous scales" pattern from the HuggingFace cookbook.

The research-backed prompt with integer Likert 1-5 spreads ratings across the entire scale and concentrates them on 2 (significant error) and 5 (perfect match) — the two semantically meaningful endpoints. There are fewer values to spread across, so the model is forced to discriminate instead of giving compromise scores.

If you ever plan to track mean_score as a regression KPI across deploys, this is the difference between "the metric moves with quality" (Likert) and "the metric is stuck near 9" (1-10).

Reproducing

The benchmark lives as a single self-contained script under implementation_layer/examples/software_components/validators/demo_judgebench_comparison.py. It is intentionally not part of the toolkit's public API — the production-facing validator surface (LLMJudge, LLMJudgePanel, compare_pairwise, calibration) stays focused on validation, while this comparison harness lives next to its docs and stays out of the core install.

cd gaik-toolkit
pip install -e ".[evaluators,llm-judge]"
pip install "datasets>=2.14"   # script-level dep; not a toolkit extra

# Set provider env vars (Vertex shown; Azure / Anthropic / OpenAI also supported)
export GOOGLE_VERTEXAI_PROJECT=<your-project>
export GOOGLE_VERTEXAI_LOCATION=global
export GOOGLE_APPLICATION_CREDENTIALS=<service-account.json>

# Quick smoke (10 rows, ~$0.05)
python implementation_layer/examples/software_components/validators/demo_judgebench_comparison.py --n 10

# Full run (50 rows, seed 42, writes leaderboard.md + summary.json)
python implementation_layer/examples/software_components/validators/demo_judgebench_comparison.py \
    --n 50 \
    --provider google \
    --out implementation_layer/examples/software_components/validators/judgebench-comparison/

Plugging in your own data

The HF loader is a convenience for the public dataset. To run on your own pairwise data, copy the script and replace the load_judgebench call with a list of plain dicts:

# In a copy of demo_judgebench_comparison.py
rows = [
    {
        "pair_id": "case-1",
        "question": "What is the boiling point of water at sea level?",
        "response_a": "100 °C (212 °F).",
        "response_b": "Around 95 °C in most conditions.",
        "label": "A>B",
    },
    # ... more rows
]

bench = JudgeBenchmarkHarness(provider="google", model="gemini-3-flash-preview")
report = bench.run(rows, prompts=("naive", "research"), swap_and_average=True)

for s in report.prompts:
    print(s.prompt, s.accuracy, s.position_bias_rate, s.mean_cost_usd)

Use this to run the same A/B comparison on your own pairwise data without touching the toolkit's source. If the same prompt-design advantage shows up on your data too, you have empirical evidence that the research-backed defaults are worth keeping.

Caveats

N=50 is small statistically — confidence intervals on accuracy are wide. The benchmark is designed to demonstrate methodology and give a directional signal, not to claim significance. Scale up the --n flag if you need narrower bounds.
JudgeBench is general (knowledge / reasoning / math / coding tasks). If you are evaluating a domain-specific judge — e.g. for document extraction — pair this benchmark with a calibration run on your own labelled data (calibration utility).
Prompt-design effect can be model-dependent. This run uses Gemini 3 Flash; the same prompt pair may give different deltas on Claude or GPT. Add --provider anthropic / --provider openai to extend the comparison.

LLM-as-Judge Validation — what the research-backed prompt looks like, why every component is there
Calibration against human labels — measure judge vs human agreement on your own data
Pairwise comparison utility — vision-based pairwise judging for extraction outputs

References

HuggingFace LLM Judge cookbook — Likert vs continuous scales (~30 % human-correlation gain)
JudgeBench paper (arXiv 2410.12784) — the dataset and methodology
Justice or Prejudice (OpenReview 2024) — position-bias swap mitigation
Quantitative LLM Judges (arXiv 2506.02945) — calibration adjustments at scale
Mitigating Bias of LLM Evaluation (arXiv 2409.16788) — anti-bias prompt construction

What this measures

Methodology


Dataset	`ScalerLab/JudgeBench` — 350 challenging response pairs with ground-truth A>B / B>A labels, MIT licence (arXiv 2410.12784)
Sample size	50 rows (seed=42) from the `gpt` split
Judge model	`gemini-3-flash-preview` via Vertex AI, thinking budget 0
Prompt A — naive	"Score response from 1-10. Pick the winner." No CoT, no bias guidance
Prompt B — research	Likert 1-5, evaluate-before-scoring CoT, anti-verbosity / anti-formatting / anti-authority guidance
Position-bias mitigation	swap-and-average for both prompts (each pair judged twice with A/B reversed; tie when passes disagree)
Metrics	Accuracy vs ground truth, position-bias flip rate, ties, mean cost (USD), mean latency (s), score distribution

The harness is dataset-agnostic; you can pass any (question, response_a, response_b, label) source into the in-script JudgeBenchmarkHarness.run() method.

Results

Prompt	Accuracy	Position-bias flips	Ties	Mean cost (USD)	Mean latency (s)
`naive`	64.0%	26.0%	13	$0.0017	13.85
`research`	66.0%	22.0%	14	$0.0023	16.45

The research-backed prompt wins on the two metrics that actually matter for production use:

+2 percentage points of accuracy against the held-out ground-truth labels — small on N=50 but in the expected direction.
-4 percentage points of position-bias flips: 22 % of pairs change their predicted winner under the swap-and-average check, vs. 26 % for the naive prompt. Position-bias is the failure mode that causes silent inconsistency in production, so this is the more important number.

Score distribution — the qualitative story

The score histograms are the clearest artefact for a reader: they show qualitatively why a naive 1-10 prompt is hard to use as a metric.

naive (1-10 scale, 100 ratings = 50 pairs × 2 sides): 2 × 9, 3 × 4, 4 × 14, 5 × 6, 6 × 15, 7 × 3, 8 × 9, 9 × 14, 10 × 26
research (Likert 1-5): 1 × 3, 2 × 36, 3 × 7, 4 × 21, 5 × 33

If you ever plan to track mean_score as a regression KPI across deploys, this is the difference between "the metric moves with quality" (Likert) and "the metric is stuck near 9" (1-10).

Reproducing

cd gaik-toolkit
pip install -e ".[evaluators,llm-judge]"
pip install "datasets>=2.14"   # script-level dep; not a toolkit extra

# Set provider env vars (Vertex shown; Azure / Anthropic / OpenAI also supported)
export GOOGLE_VERTEXAI_PROJECT=<your-project>
export GOOGLE_VERTEXAI_LOCATION=global
export GOOGLE_APPLICATION_CREDENTIALS=<service-account.json>

# Quick smoke (10 rows, ~$0.05)
python implementation_layer/examples/software_components/validators/demo_judgebench_comparison.py --n 10

# Full run (50 rows, seed 42, writes leaderboard.md + summary.json)
python implementation_layer/examples/software_components/validators/demo_judgebench_comparison.py \
    --n 50 \
    --provider google \
    --out implementation_layer/examples/software_components/validators/judgebench-comparison/

Plugging in your own data

The HF loader is a convenience for the public dataset. To run on your own pairwise data, copy the script and replace the load_judgebench call with a list of plain dicts:

# In a copy of demo_judgebench_comparison.py
rows = [
    {
        "pair_id": "case-1",
        "question": "What is the boiling point of water at sea level?",
        "response_a": "100 °C (212 °F).",
        "response_b": "Around 95 °C in most conditions.",
        "label": "A>B",
    },
    # ... more rows
]

bench = JudgeBenchmarkHarness(provider="google", model="gemini-3-flash-preview")
report = bench.run(rows, prompts=("naive", "research"), swap_and_average=True)

for s in report.prompts:
    print(s.prompt, s.accuracy, s.position_bias_rate, s.mean_cost_usd)

Caveats

N=50 is small statistically — confidence intervals on accuracy are wide. The benchmark is designed to demonstrate methodology and give a directional signal, not to claim significance. Scale up the --n flag if you need narrower bounds.
JudgeBench is general (knowledge / reasoning / math / coding tasks). If you are evaluating a domain-specific judge — e.g. for document extraction — pair this benchmark with a calibration run on your own labelled data (calibration utility).
Prompt-design effect can be model-dependent. This run uses Gemini 3 Flash; the same prompt pair may give different deltas on Claude or GPT. Add --provider anthropic / --provider openai to extend the comparison.

LLM-as-Judge Validation — what the research-backed prompt looks like, why every component is there
Calibration against human labels — measure judge vs human agreement on your own data
Pairwise comparison utility — vision-based pairwise judging for extraction outputs

References

HuggingFace LLM Judge cookbook — Likert vs continuous scales (~30 % human-correlation gain)
JudgeBench paper (arXiv 2410.12784) — the dataset and methodology
Justice or Prejudice (OpenReview 2024) — position-bias swap mitigation
Quantitative LLM Judges (arXiv 2506.02945) — calibration adjustments at scale
Mitigating Bias of LLM Evaluation (arXiv 2409.16788) — anti-bias prompt construction

What this measures

Methodology

Results

Score distribution — the qualitative story

Reproducing

Plugging in your own data

Caveats

References

On this page

LLM-Judge Prompt Benchmark

What this measures

Methodology

Results

Score distribution — the qualitative story

Reproducing

Plugging in your own data

Caveats

References

On this page