← Matrix Hub

The Latin Restoration Benchmark

Does the machine know when it is guessing? · 机器是否知道自己在猜

Loading the benchmark…

Confidently wrong vs honestly unsure · 自信地错,或诚实地不确定

A reliability diagram plots, for each confidence level, how often the model was actually right. On the dotted diagonal, confidence equals accuracy — the model's certainty is honest. A specialist restorer hugs the line; a general LLM floats high and to the right: near-total confidence, near-zero accuracy.

The leaderboard · 排行

Latin restoration: recover the masked word an editor once supplied. exact = recovered it; CER = character error rate; ECE = expected calibration error (how far stated confidence is from real accuracy — lower is more honest). Two test splits: leakage-free (held out from every model) and the looser Aeneas test split.

ModelSplitnexactCERECE

The protocol · 方法