LLM-Judge Prompt Benchmark
Empirical comparison of naive 1-10 scoring vs the research-backed Likert + CoT + bias-mitigation prompt on a public dataset
What this measures
The LLM-as-Judge Validation page argues, citing the HuggingFace LLM Judge cookbook and several research papers, that an integer Likert prompt with explicit Chain-of-Thought ordering and anti-bias guidance produces more reliable judgements than a naive "score this 1-10" prompt. This benchmark turns that claim into a number.
A self-contained research script
(demo_judgebench_comparison.py)
runs the same model under two prompt variants on a public dataset,
compares the predicted winner against ground truth, and reports which
design wins. The harness is dataset-agnostic — JudgeBench is the default
because it is open, peer-reviewed, and small enough to run cheaply.
Methodology
| Dataset | ScalerLab/JudgeBench — 350 challenging response pairs with ground-truth A>B / B>A labels, MIT licence (arXiv 2410.12784) |
| Sample size | 50 rows (seed=42) from the gpt split |
| Judge model | gemini-3-flash-preview via Vertex AI, thinking budget 0 |
| Prompt A — naive | "Score response from 1-10. Pick the winner." No CoT, no bias guidance |
| Prompt B — research | Likert 1-5, evaluate-before-scoring CoT, anti-verbosity / anti-formatting / anti-authority guidance |
| Position-bias mitigation | swap-and-average for both prompts (each pair judged twice with A/B reversed; tie when passes disagree) |
| Metrics | Accuracy vs ground truth, position-bias flip rate, ties, mean cost (USD), mean latency (s), score distribution |
The harness is dataset-agnostic; you can pass any (question, response_a, response_b, label) source into the in-script
JudgeBenchmarkHarness.run() method.
Results
Numbers below come from
implementation_layer/examples/software_components/validators/judgebench-comparison/leaderboard.md
in the gaik-toolkit repository, run with seed 42 on the gpt split.
Re-run the demo against the same seed to reproduce.
| Prompt | Accuracy | Position-bias flips | Ties | Mean cost (USD) | Mean latency (s) |
|---|---|---|---|---|---|
naive | 64.0% | 26.0% | 13 | $0.0017 | 13.85 |
research | 66.0% | 22.0% | 14 | $0.0023 | 16.45 |
The research-backed prompt wins on the two metrics that actually matter for production use:
- +2 percentage points of accuracy against the held-out ground-truth labels — small on N=50 but in the expected direction.
- -4 percentage points of position-bias flips: 22 % of pairs change their predicted winner under the swap-and-average check, vs. 26 % for the naive prompt. Position-bias is the failure mode that causes silent inconsistency in production, so this is the more important number.
The trade-off is +35 % cost and +19 % latency because the
research-backed prompt asks for a reason field before the score (the
"evaluate first, judge second" CoT ordering). On Gemini Flash this still
costs about a fifth of a cent per pair.
Score distribution — the qualitative story
The score histograms are the clearest artefact for a reader: they show qualitatively why a naive 1-10 prompt is hard to use as a metric.
naive(1-10 scale, 100 ratings = 50 pairs × 2 sides):2× 9,3× 4,4× 14,5× 6,6× 15,7× 3,8× 9,9× 14,10× 26research(Likert 1-5):1× 3,2× 36,3× 7,4× 21,5× 33
The naive prompt piles 26 % of all ratings on the maximum value 10 and another 14 on 9 — almost half the responses get a 9 or 10. The middle of the scale (3-7) is sparsely populated. This is the classic "LLMs oversaturate at the high end of continuous scales" pattern from the HuggingFace cookbook.
The research-backed prompt with integer Likert 1-5 spreads ratings across the entire scale and concentrates them on 2 (significant error) and 5 (perfect match) — the two semantically meaningful endpoints. There are fewer values to spread across, so the model is forced to discriminate instead of giving compromise scores.
If you ever plan to track mean_score as a regression KPI across
deploys, this is the difference between "the metric moves with quality"
(Likert) and "the metric is stuck near 9" (1-10).
Reproducing
The benchmark lives as a single self-contained script under
implementation_layer/examples/software_components/validators/demo_judgebench_comparison.py.
It is intentionally not part of the toolkit's public API — the
production-facing validator surface (LLMJudge, LLMJudgePanel,
compare_pairwise, calibration) stays focused on validation, while
this comparison harness lives next to its docs and stays out of the
core install.
cd gaik-toolkit
pip install -e ".[evaluators,llm-judge]"
pip install "datasets>=2.14" # script-level dep; not a toolkit extra
# Set provider env vars (Vertex shown; Azure / Anthropic / OpenAI also supported)
export GOOGLE_VERTEXAI_PROJECT=<your-project>
export GOOGLE_VERTEXAI_LOCATION=global
export GOOGLE_APPLICATION_CREDENTIALS=<service-account.json>
# Quick smoke (10 rows, ~$0.05)
python implementation_layer/examples/software_components/validators/demo_judgebench_comparison.py --n 10
# Full run (50 rows, seed 42, writes leaderboard.md + summary.json)
python implementation_layer/examples/software_components/validators/demo_judgebench_comparison.py \
--n 50 \
--provider google \
--out implementation_layer/examples/software_components/validators/judgebench-comparison/Plugging in your own data
The HF loader is a convenience for the public dataset. To run on your own
pairwise data, copy the script and replace the load_judgebench call
with a list of plain dicts:
# In a copy of demo_judgebench_comparison.py
rows = [
{
"pair_id": "case-1",
"question": "What is the boiling point of water at sea level?",
"response_a": "100 °C (212 °F).",
"response_b": "Around 95 °C in most conditions.",
"label": "A>B",
},
# ... more rows
]
bench = JudgeBenchmarkHarness(provider="google", model="gemini-3-flash-preview")
report = bench.run(rows, prompts=("naive", "research"), swap_and_average=True)
for s in report.prompts:
print(s.prompt, s.accuracy, s.position_bias_rate, s.mean_cost_usd)Use this to run the same A/B comparison on your own pairwise data without touching the toolkit's source. If the same prompt-design advantage shows up on your data too, you have empirical evidence that the research-backed defaults are worth keeping.
Caveats
- N=50 is small statistically — confidence intervals on accuracy
are wide. The benchmark is designed to demonstrate methodology and
give a directional signal, not to claim significance. Scale up the
--nflag if you need narrower bounds. - JudgeBench is general (knowledge / reasoning / math / coding tasks). If you are evaluating a domain-specific judge — e.g. for document extraction — pair this benchmark with a calibration run on your own labelled data (calibration utility).
- Prompt-design effect can be model-dependent. This run uses Gemini
3 Flash; the same prompt pair may give different deltas on Claude or
GPT. Add
--provider anthropic/--provider openaito extend the comparison.
Related docs
- LLM-as-Judge Validation — what the research-backed prompt looks like, why every component is there
- Calibration against human labels — measure judge vs human agreement on your own data
- Pairwise comparison utility — vision-based pairwise judging for extraction outputs
References
- HuggingFace LLM Judge cookbook — Likert vs continuous scales (~30 % human-correlation gain)
- JudgeBench paper (arXiv 2410.12784) — the dataset and methodology
- Justice or Prejudice (OpenReview 2024) — position-bias swap mitigation
- Quantitative LLM Judges (arXiv 2506.02945) — calibration adjustments at scale
- Mitigating Bias of LLM Evaluation (arXiv 2409.16788) — anti-bias prompt construction
GAIK