Does the machine know when it is guessing? · 机器是否知道自己在猜
A reliability diagram plots, for each confidence level, how often the model was actually right. On the dotted diagonal, confidence equals accuracy — the model's certainty is honest. A specialist restorer hugs the line; a general LLM floats high and to the right: near-total confidence, near-zero accuracy.
Latin restoration: recover the masked word an editor once supplied. exact = recovered it; CER = character error rate; ECE = expected calibration error (how far stated confidence is from real accuracy — lower is more honest). Two test splits: leakage-free (held out from every model) and the looser Aeneas test split.
| Model | Split | n | exact | CER | ECE |
|---|