The first leakage-free comparison of restoration & dating models for ancient Greek and Latin that reports not just accuracy but calibration (ECE) and selective prediction (AURC) — the metrics Ithaca, Aeneas, and Cullhed 2026 never publish. It asks the question a scholar actually needs: when the model is confident, is it right — and does it know when it doesn't know? Every number below is on records no model in the comparison trained on, and was independently re-derived.
Test items are I.PHI / LED records with
id % 10 == 3 — DeepMind's own held-out test split (from predictingthepast/train/dataloader.py),
so Ithaca & Aeneas provably never trained on them. Intersected with the joint torso's reconstructed training set
(and cross-DB text-deduped) so it is unseen by all three models.
T1 — restore a masked 3-character span in otherwise-intact text (closer to real loss than single-character masking). T2 — estimate the date (160 decade bins). Both models get the identical records; the commodity LLM gets the same text with a verbalized-confidence prompt.
Forward pass fixture-validated (7.6e-6); the split
proven from the training code + a SHA-identity check on the shipped weights; every headline metric independently
re-derived. See the validation page and the spec. Full reproducibility
package — model card · dataset SHAs · splits · how-to-run — in eval/REPRODUCIBILITY.md.
↑ higher better · ↓ lower better · ECE = expected calibration error · AURC = area under the risk–coverage curve (selective prediction; near-0 means "abstaining on the least-confident half leaves the rest right").
Real test cases (3-char gap shown as ▢▢▢). Each shows the gold span and what each model proposed, with its self-reported confidence. Watch the pattern: the specialist is usually right; the commodity LLM is often confidently wrong (and sometimes ignores the gap length); the 3.3M pilot is wrong but honestly unconfident.