Symbolon · Validation — what we can and can't claim, with numbers

Symbolon consolidates two open-source epigraphy specialists — Ithaca (Greek) and Aeneas (Latin) — behind one research service, plus a small in-browser joint restorer. Three claims need evidence before anyone trusts the output. This page gives the evidence for each, and flags what is still pending.Symbolon 将两个开源铭文专家模型 —— Ithaca(希腊文)与 Aeneas(拉丁文)—— 整合于一项研究服务之后, 另附一个浏览器内的小型联合修复器。三项断言须有证据方可取信。本页逐一给出证据, 并标明尚待补足者。

1. Cross-corpus is an identity / concept bridge — not one shared vector geometry1. 跨语料为身份／概念之桥 —— 而非单一共享向量几何

empirical correction

The intuitive design — concatenate the Greek and Latin embeddings into one index and let a single cosine search cross the language boundary — was tested and refuted. Ithaca and Aeneas are separately-trained models; their 384-dim clouds are nearly orthogonal: 直觉式设计 —— 将希腊与拉丁嵌入拼成一个索引, 用单次余弦检索跨越语言边界 —— 经实测被否证。Ithaca 与 Aeneas 为各自独立训练之模型, 其 384 维点云近乎正交:

mean-vector similarity cos(grc̄, lat̄) = −0.007 (essentially orthogonal);均值向量相似度 cos(grc̄, lat̄) = −0.007(基本正交);
within-corpus top-5 neighbours sit at ~0.80–0.97, but cross-corpus neighbours only ~0.13–0.19 — a single index would rank same-language texts far above any true cross-language parallel;同语料前五近邻相似度约 0.80–0.97, 而跨语料近邻仅 0.13–0.19 —— 单一索引会把同语言文本排在任何真正跨语言平行之上;
and the supervised anchor set a vector alignment (Procrustes / VecMap) would need is empty: in the bridge index Greek rows are keyed only by PHI ids, Latin rows only by EDCS/TM — 0 identity anchors are shared across grc↔lat. A learned rotation has nothing to fit.且任何向量对齐(Procrustes／VecMap)所需之监督锚点集为空: 桥索引中希腊行仅以 PHI 号为键, 拉丁行仅以 EDCS／TM 为键 —— grc↔lat 间共享身份锚点为 0。习得旋转无可拟合之物。

So the architecture is the honest one, not a workaround: retrieve per-corpus in each model's own validated space, and bridge across languages at the concept / identity layer — shared ids (ids_alt) and the editions, the only nodes that can carry both a PHI and an EDCS id. Any page that says Symbolon searches "one shared space across languages" is wrong; this is the corrected claim.故此架构乃诚实之选, 非权宜之计: 于各模型自有(已验证)空间内分语料检索, 而在概念／身份层跨语言搭桥 —— 借共享号(ids_alt)与校订本(唯一可同时承载 PHI 与 EDCS 号之节点)。任何声称 Symbolon 在"跨语言单一共享空间"中检索之页面均属有误; 此为更正后之断言。

source: probe_geometry.py (2026-05) · bridge data census + scripts/upgrade_bridge_v2.py (2026-06-13) · Symbolon §0.5

2. The orchestrating model must abstain, not invent — measured on real lacunae2. 编排模型须知止而不臆造 —— 在真实残缺处实测

Invariant §0.1 forbids the orchestrating LLM from guessing a restoration from its own weights; a fabricated reading is the worst failure mode. We tested three candidate models on a balanced sample from our own ground truth (I.Sicily EpiDoc TEI): 100 translation + 100 abstention items (50 answerable / 50 genuinely unanswerable), run sequentially with a hardened runner.不变式 §0.1 禁止编排 LLM 凭自身权重臆测补字; 杜撰读法乃最劣之失。我们以自有真值(I.Sicily EpiDoc TEI)之均衡样本测三个候选模型: 100 条翻译 + 100 条知止项(50 可答／50 确不可答), 以加固运行器顺序执行。

model模型	EN chrF	restore-exact补字精确	abstain-correct知止正确	hallucinations幻觉	data loss丢数据
qwen3.7-plus	0.619	0.46	1.00	0 / 50	4 / 200
deepseek-v4-flash	0.576	0.20	0.78	11 / 50	0 / 200
kimi-k2.6 ⚠ invalid	0.167*	0.52*	0.16*	42*	38 / 200*

Qwen3.7-plus is the clear winner on the decisive axis. It is the only model that both restores well when it can (0.46 exact, highest) and abstains perfectly when it can't (100% correct, 0 fabricated readings), while also topping English translation. Its abstentions are genuine — clean "uncertain — the lacuna admits several possible readings." For the public shared-key Reading Room, where a fabricated reading is the worst failure, this is exactly the wanted behaviour, verified on real lacunae.在决定性维度上 Qwen3.7-plus 明显胜出。唯独它能有把握时补得好(0.46 精确, 最高), 又能无把握时完全知止(100% 正确, 0 杜撰读法), 且英译亦居首。其知止为真 —— 干净之"不确定 —— 此残缺容多种读法"。对公开共享密钥之阅览室而言, 杜撰读法乃最劣之失, 此正为所需之行为, 且在真实残缺处验证。

DeepSeek V4 Flash is the safe, cheap runner-up — solid English (0.576), 78% honest abstention, notably cautious (its low restore-exact reflects a high abstention tendency, not poor knowledge), zero data loss, ~6–12× cheaper. A defensible production default where cost and latency win and 78% honesty is acceptable.DeepSeek V4 Flash 为稳妥而廉价之次选 —— 英译扎实(0.576), 78% 诚实知止, 尤为审慎(其低补字精确反映高知止倾向, 非知识不足), 零丢数据, 约廉 6–12 倍。于成本与时延为重且 78% 诚实可接受之生产场景, 为合理默认。

⚠ Kimi K2.6 results are INVALID this run — do not judge it on these numbers. Two artifacts sank its scores: 38/200 empty outputs (a 90 s hard cap, added to stop a real hang, fired on its slow reasoning) and raw chain-of-thought leaking into answers. It needs a re-run with a longer timeout and reasoning/content separation; its actual differentiator — vision (reading a stone photo) — wasn't tested here. * asterisked numbers are artifact-driven, not a verdict.本轮 Kimi K2.6 结果无效 —— 勿据此数字评判。两项假象拉低其分: 38/200 空输出(为止住真实卡死而设之 90 秒硬上限, 误触其慢推理), 及原始思维链泄入答案。须以更长超时及推理／内容分离重跑; 其真正差异点 —— 视觉(读石照)—— 本轮未测。* 带星号者为假象所致, 非定论。

Chinese chrF is excluded, not hidden: these terse, proper-name-dominated funerary texts make chrF punish legitimate name-transliteration variance brutally (菲洛美娜 vs 斐路梅纳斯, both correct → 0.0). No Chinese-quality claim is made from this eval until a fuller reference set (the gate-verified Happy Latin pages) replaces name-heavy stones.中文 chrF 已剔除, 非隐藏: 此类简短、以专名为主之丧葬文本令 chrF 严惩正当之音译差异(菲洛美娜对斐路梅纳斯, 皆正确却 → 0.0)。在更完整之参照集(已闸验之快乐拉丁语页面)取代专名密集之石刻前, 本评测不作任何中文质量断言。

source: eval/RESULTS.md — Run 2 (2026-06-12), hardened sequential runner, scored on the two trustworthy axes (EN chrF + abstention/restoration).

3. The in-browser joint restorer — calibrated, on provably held-out data3. 浏览器内联合修复器 —— 在可证留出数据上已校准

The live joint restorer is a small (3.3 M-param, 4-layer / 256-dim) character transformer with a corpus embedding, CPU-trained across three corpora and shipped as plain <script> files. It is a pilot, not a rival to DeepMind's specialists — but its numbers are honest and its confidence is calibrated. Measured on a provably held-out validation set (disjoint from training; shipped weights SHA-identical to the run; leakage-free):实时联合修复器为一小型(330 万参数, 4 层／256 维)字符 Transformer, 附语料嵌入, 跨三语料 CPU 训练, 以普通 <script> 发布。它是试点, 非与 DeepMind 专模争锋 —— 然其数字诚实, 其置信已校准。测于可证留出验证集(与训练不交; 发布权重与该次运行 SHA 一致; 无泄漏):

task任务	n	metric指标	calibration校准
T1 restoration (grc)T1 补字(希)	478	top-1 exact 0.554 · CER 0.446	ECE 0.060 · AURC 0.243
T2 date (±50 yr)T2 定年(±50 年)	276	acc 0.370 · MAE 122.7 yr	p=0.8 → cov 0.76
T3 region (of 225)T3 地域(共 225)	466	top-1 0.352 · top-5 0.614	exact-name精确名

What "calibrated" buys you: an Expected Calibration Error of 0.060 means the model's stated confidence tracks its real accuracy — when it says 0.85, it is right ~85% of the time. So the ranked hypotheses and the confidence on each are trustworthy as a selective tool: restrict to the model's most-confident half and restoration error drops from 0.446 to 0.289 (selective risk @50% coverage). That is the honest use — a confidence-aware shortlist for a human, never a single verdict (invariant §0.2)."已校准"之价值: 期望校准误差 0.060 意味模型所述置信贴合其真实准确 —— 它说 0.85 时, 约八成半为真。故各排名假设及其置信可作选择性工具而取信: 仅取模型最有把握之半数, 补字误差即由 0.446 降至 0.289(50% 覆盖下之选择风险)。此为诚实用法 —— 供人参考之置信感知候选, 绝非单一定论(不变式 §0.2)。

Reliability (confidence bin → real accuracy)可靠性(置信分箱 → 真实准确)

stated confidence所述置信	n	actual accuracy实际准确
0.2–0.3	80	0.30
0.3–0.4	86	0.535
0.5–0.6	46	0.565
0.7–0.8	28	0.679
0.8–0.9	46	0.848
0.9–1.0	55	0.964

Stated confidence and actual accuracy rise together — the mark of a calibrated model. (T3 region is alignment-limited: the model predicts 225 fine regions scored by exact-name match; a coarse-region map would raise the headline number.)所述置信与实际准确同步上升 —— 此乃已校准模型之征。(T3 地域受对齐所限: 模型预测 225 细分地域, 按精确名计分; 一份粗分地域映射可抬高表头数字。)

source: eval/RESULTS-heldout.md (EpiCal S2, joint torso grc, seed-0 held-out val, leakage-free). A provisional leakage-uncontrolled run (eval/RESULTS-calib.md) corroborates: top-1 0.53, ECE 0.067.

4. Four behaviours every answer must obey4. 每条回答须遵守之四种行为

The honesty axes above translate into four named behaviours the agent is held to. Each is a test case, not a slogan.上述诚实维度落为四种受检之具名行为。每一种皆为测试用例, 而非口号。

abstain Unanswerable → "uncertain", no guess不可答 → "不确定", 不臆测

A lacuna with several admissible readings returns a clean abstention, not a fabricated letter. Qwen: 50/50 correct, 0 invented-letter outputs in this sample. e.g. a 1-letter gap that could be ε or η → "uncertain — admits several readings".容多种可读之残缺返回干净之知止, 而非杜撰字母。Qwen: 50/50 正确, 本样本 0 个杜撰字母输出。例: 可为 ε 或 η 之单字缺口 → "不确定 —— 容多种读法"。

formula-supported Restore only when the specialist / corpus supports it仅当专模／语料支持时方补

A restoration is offered only as the specialist's ranked top-k with saliency — never the orchestrator's own guess (§0.1). The joint restorer's 0.554 top-1 is calibrated, so the shortlist is trustworthy as a shortlist.补字仅以专模之带显著性前 k 排名给出 —— 绝非编排者自身臆测(§0.1)。联合修复器 0.554 之首选已校准, 故候选可作候选而取信。

model-only (flagged) General-LLM knowledge is labelled, never laundered as fact通用 LLM 知识须标注, 不得洗作事实

If only the orchestrating model's general knowledge (not a specialist call or the index) supports a statement, it is labelled as such and handed back for confirmation — it does not enter the citation set. Every cited id must be present in the retrieved set; others are stripped (§0.1, §0.3).若某陈述仅由编排模型之通用知识(非专模调用或索引)支持, 则如实标注并交回确认 —— 不入引用集。每个被引号须存于检索集, 余者剔除(§0.1, §0.3)。

cross-lang bridge Greek↔Latin links go through identity / concept, not cosine希↔拉之链经身份／概念, 非余弦

A cross-language parallel is asserted only via a shared identity (an edition carrying both a PHI and an EDCS id, or an ids_alt match) or a shared concept — never via a single cross-corpus cosine score, which §1 showed is meaningless across the boundary.跨语言平行仅经共享身份(同载 PHI 与 EDCS 号之校订本, 或 ids_alt 匹配)或共享概念断言 —— 绝不经单一跨语料余弦分, 该分已于第 1 节证明跨边界无意义。

5. The non-negotiable invariants5. 不可让渡之不变式

The specialists are the only source of epigraphic facts. Restorations, datings, places, parallels come only from Ithaca/Aeneas or the index — never from the orchestrating LLM's weights.专模为铭文事实之唯一来源。补字、定年、地域、平行皆出自 Ithaca/Aeneas 或索引 —— 绝非编排 LLM 之权重。
Preserve uncertainty through every layer. Restoration is a top-k ranked list, dating is a distribution — never collapsed to one confident claim.逐层保全不确定性。补字为前 k 排名, 定年为分布 —— 绝不塌缩为单一自信断言。
Human-in-the-loop is the design. Every synthesis ends by naming the specialist calls that would confirm each move, and hands authority back to the scholar. Output is a starting point, never a conclusion.人在环中乃设计本身。每次综述末尾列明可确认各步之专模调用, 并将裁断权交还学者。产出为起点, 绝非结论。
Honesty about seams. Where a real model call is not yet wired, the code raises an explicit error — it does not return plausible fake data.对接缝诚实。真实模型调用尚未接通处, 代码显式报错 —— 不返回貌似可信之伪数据。

6. Case gallery — worked examples from the live bridge6. 案例集 —— 取自实时桥之实例

Specialists asked for examples, not only aggregate numbers. Every case below is generated by querying the shipped bridge (build_validation_gallery.py) — nothing is hand-authored, so the gallery is auditable and regenerates with the data.学者需要实例, 而非仅汇总数字。下列每例皆由查询所发布之桥生成(build_validation_gallery.py)—— 无一手工编写, 故案例集可审计且随数据再生。

loading cases…

Sources & honesty. Every number on this page is from magalia's own evaluation harness, run on its own held-out ground truth — not a vendor benchmark. LLM bake-off: eval/RESULTS.md (Run 2, 2026-06-12). Joint-restorer calibration: eval/RESULTS-heldout.md (provably leakage-free). Full reproducibility package (model card · dataset SHAs · splits · how-to-run): eval/REPRODUCIBILITY.md; the leakage-free comparison vs Ithaca/Aeneas/commodity LLM is the EpiCal benchmark. Bridge geometry: probe_geometry.py + bridge census (2026-06-13). Pending: a Kimi re-run with a longer timeout + vision test; a Chinese-quality axis on fuller references; a 225→coarse region map. Those are named, not hidden.来源与诚实。本页每一数字皆出自 magalia 自有评测装置, 测于自有留出真值 —— 非厂商基准。LLM 对决: eval/RESULTS.md(第 2 轮, 2026-06-12)。修复器校准: eval/RESULTS-heldout.md(可证无泄漏)。桥几何: probe_geometry.py 与桥普查(2026-06-13)。待补: 以更长超时及视觉重跑 Kimi; 于更完整参照上之中文质量轴; 225→粗分地域映射。皆已具名, 未予隐藏。