magalia · Symbolon workbench · validation · EpiCal benchmark

EpiCal — a calibration benchmark for ancient-text prediction

The first leakage-free comparison of restoration & dating models for ancient Greek and Latin that reports not just accuracy but calibration (ECE) and selective prediction (AURC) — the metrics Ithaca, Aeneas, and Cullhed 2026 never publish. It asks the question a scholar actually needs: when the model is confident, is it right — and does it know when it doesn't know? Every number below is on records no model in the comparison trained on, and was independently re-derived.

How it is leakage-free

The clean split

Test items are I.PHI / LED records with id % 10 == 3 — DeepMind's own held-out test split (from predictingthepast/train/dataloader.py), so Ithaca & Aeneas provably never trained on them. Intersected with the joint torso's reconstructed training set (and cross-DB text-deduped) so it is unseen by all three models.

The task

T1 — restore a masked 3-character span in otherwise-intact text (closer to real loss than single-character masking). T2 — estimate the date (160 decade bins). Both models get the identical records; the commodity LLM gets the same text with a verbalized-confidence prompt.

Verified, not asserted

Forward pass fixture-validated (7.6e-6); the split proven from the training code + a SHA-identity check on the shipped weights; every headline metric independently re-derived. See the validation page and the spec. Full reproducibility package — model card · dataset SHAs · splits · how-to-run — in eval/REPRODUCIBILITY.md.

Leaderboard

↑ higher better · ↓ lower better · ECE = expected calibration error · AURC = area under the risk–coverage curve (selective prediction; near-0 means "abstaining on the least-confident half leaves the rest right").

Case gallery — inspect the predictions

Real test cases (3-char gap shown as ▢▢▢). Each shows the gold span and what each model proposed, with its self-reported confidence. Watch the pattern: the specialist is usually right; the commodity LLM is often confidently wrong (and sometimes ignores the gap length); the 3.3M pilot is wrong but honestly unconfident.

What the benchmark shows

1 · The realistic task separates the models. On a 3-char span the production specialists hold ~0.75–0.82 top-1, but the in-browser pilot collapses to ~0.02 (it cannot restore a contiguous span) and the commodity LLM manages ~0.06–0.08.
2 · The commodity LLM is confidently wrong. DeepSeek's verbalized confidence is the worst-calibrated by far (ECE ≈ 0.8): it self-reports ~0.85 confidence while being ~6% correct, and gives no useful abstention signal. The empirical case against trusting a general LLM's self-reported confidence on epigraphy.
3 · Only the specialists are capable and calibrated (ECE ≈ 0.10, AURC ≈ 0.05). The pilot's near-zero ECE is honesty-not-skill: it correctly reports low confidence at near-zero accuracy.
Honest limits. The 3-char protocol is harder than the specialists' published multi-/unknown-length benchmark, so these top-1 numbers are not comparable to their papers' headlines — the comparison's validity is that all models face the identical task on the identical records. A residual cross-database text-overlap on the joint torso's side is deduped at id+text level (near-duplicate editions may remain). Region (place) is omitted from the head-to-head — the models use different region label spaces. n = 284 (Greek) / 291 (Latin) restoration items.