One inscription, five databases.
一块铭文,五个数据库。
A small marble plaque, 15.5 × 24.5 × 3 cm, was first recorded around 1730 in the Museo Salnitriano in Palermo and now sits in the Museo Archeologico Regionale Antonino Salinas. It carries seven lines of Greek on the left and seven lines of archaic Latin on the right, separated by a deeply incised vertical rule. The text is — fittingly — an advertisement for a stonecutter's shop, in two languages, telling passers-by that "inscriptions are designed and carved here for sacred temples in connection with public works." The inscription literally advertises the production of inscriptions.
一块小型大理石板,尺寸 15.5 × 24.5 × 3 厘米,约 1730 年首次记录于巴勒莫的 Museo Salnitriano,现藏于 Museo Archeologico Regionale Antonino Salinas(萨利纳斯考古博物馆)。板面被一道深刻的竖线分成两栏:左栏七行希腊文,右栏七行带古拼写的拉丁文。文本本身是一则石匠铺的广告,用两种语言告知路人:"此处为神圣庙宇及公共工程刻制铭文。"换言之,这块铭文广告的内容,正是"刻制铭文"。
It is the perfect case study for what Heřmánková, Kaše & Sobotková (JDH 2021) are up against. This single object appears, with subtly different metadata, in every major Latin and Greek epigraphic database — except one. Walking through how each database represents it shows in concrete detail why building a clean, comparable, cross-database dataset is so much harder than the abstract ETL-pipeline diagrams suggest.
对 Heřmánková, Kaše & Sobotková(JDH 2021)所讨论的问题而言,这是一个绝佳案例。这一件文物出现在几乎所有主要的拉丁文与希腊文铭文数据库里,只有一个例外,而每个库给它的元数据都略有不同。逐一查看各库的呈现方式,可以具体看到:"建立一份跨库可比的干净数据集"为什么远比抽象的 ETL 流水线图所暗示的更难。
§ 1The text铭文文本
The inscription as carved on the stone. Hover over a Greek line to highlight the Latin equivalent — they are not parallel translations but loose biversion, with awkward word-for-word renderings on both sides.
石面上的文字。把鼠标移到希腊文某一行,对应的拉丁文会高亮,两栏并非整齐对译,而是一种"双版本"翻译,两边都带着字字直译的笨拙。
heic (later hic), aidibus sacreis (later aedibus sacris), qum (later cum). The archaisms led 19th-c. editors to date the plaque to the late Republic; modern paleographic and spelling analysis (Wilson 1990) puts it Augustan to Julio-Claudian, while Manni Piraino preferred late 2nd c. CE. The dating disagreement matters — see Issue 3.
拉丁栏使用古拼写:heic(后期 hic)、aidibus sacreis(后期 aedibus sacris)、qum(后期 cum)。19 世纪学者据此把它定到共和晚期;现代字体学与拼写研究(Wilson 1990)把它定在奥古斯都至尤利-克劳狄王朝;Manni Piraino 则倾向公元 2 世纪末。这一定年分歧有实际后果:见后文问题 3。
§ 1.5From paper to digital — 140 years of editions从纸到数字,一百四十年的版本史
Before any of the modern databases existed, this inscription had already been published twice in the great 19th-century print corpora. Each editor made different choices. Each digitization that followed inherited some of those choices and silently dropped others.
在任何现代数据库出现之前,这块铭文早已被两次收入 19 世纪的大型印本丛刊,每位编者做出不同的取舍,而其后的每一次数字化又继承一些、悄悄丢掉另一些。
The chain stretches at least eight steps long: physical stone (1st c. CE) → 18th-century manuscript transcription (Torremuzza, Ignarra) → CIL X 7296 (Mommsen, 1883) → IG XIV 297 (Kaibel, 1890) → 20th-century revisions (Wilson, Manni Piraino, Bivona) → five modern digital databases → SDAM's cleaning pipelines → JDH 2021 paper. Look at the two print editions side by side:
这条链条至少有八环:实物石头(公元 1 世纪)→ 18 世纪手抄记录(Torremuzza、Ignarra)→ CIL X 7296(Mommsen,1883)→ IG XIV 297(Kaibel,1890)→ 20 世纪诸修订(Wilson、Manni Piraino、Bivona)→ 五个现代数字数据库→ SDAM 的清洗流水线 → JDH 2021 论文。把两个 19 世纪印本并置:
Original print pages (scans)印本原页(扫描)
These are the actual pages — what every database in this case study ultimately quotes. Compare them with the cleaned facsimiles below: where the editorial discussion above the inscription gets dropped in modern databases, this is the prose lost.
这是真实的页面,本案例所有数据库最终都引自此。把它们与下方的清晰摹本对照阅读:现代数据库丢失了铭文上方的编辑论述,此处即所失之文。
Three things the print editions kept
印本保留的三件事
- Argumentation. Both editors record their reasoning: Mommsen argues "Siculam originem prodit quod bilinguis est, cum litterae sint optimae aetatis" — the Sicilian origin is shown by the bilingualism, the dating by the high quality of the letterforms. None of the modern databases preserves this reasoning chain. They give you a place and a date; the print editions give you why.
- 论证。两位编者都记录了他们的推理:Mommsen 写道 "Siculam originem prodit quod bilinguis est, cum litterae sint optimae aetatis"(双语性显示其西西里出身,字体之精美显示其年代)。任何现代数据库都没有保留这条推理链。它们告诉你"地点"和"日期",印本告诉你为什么。
- Provenance history. Mommsen writes "Panormi olim apud Iesuitas, nunc in museo publico" — formerly with the Jesuits in Palermo, now in the public museum. EDR has only "Palermo?" with a question mark. The 1762 → 1883 → modern chain is in the print but lost from the digital.
- 出土流转史。Mommsen 写"Panormi olim apud Iesuitas, nunc in museo publico"(曾在巴勒莫耶稣会,现藏公共博物馆)。EDR 仅记"Palermo?"。1762 → 1883 → 现今的流转链在印本里清晰,在数字层却已丢失。
- Editorial judgment. Mommsen calls the stonecutter's bilingualism infantia ("infancy / incompetence"). Kaibel demurs: "nec Graecus opinor nec Romanus homo" ("a man neither Greek nor Roman, in my opinion"). Two scholarly judgments coexist on the same artifact. The modern databases collapse both into a single normalized "type: epitaph" or "type: advertisement."
- 编者判断。Mommsen 把石匠的双语水平称作 infantia("幼稚/不通")。Kaibel 不同意,他说 "nec Graecus opinor nec Romanus homo"("我以为他既非希腊人也非罗马人")。同一件文物上有两种学术判断并存。而现代数据库把两者都压缩为统一的 "type: epitaph" 或 "type: advertisement"。
Three things the print editions changed (and the digital inherited)
印本改动了的三件事(数字层一并继承)
- The line "QVM OPERVM" became "CVM OPERVM." Both Mommsen (1883) and Kaibel (1890) silently regularize the archaic
QVMon the stone toCVM. Look at line 6 of both columns above — the orange marker on the CIL flags it. The actual stone says QVM (visible in the photograph in the previous section). EDCS, EDR, and PHI 140601 all inherit "CVM" from the print editions. Only PHI 175744 and I.Sicily preserve "QVM" — and only I.Sicily encodes both forms with EpiDoc<choice>markup. The textual normalization that begins on Mommsen's printing block in 1883 is still propagating in 2024. - "QVM OPERVM" 被改作 "CVM OPERVM"。Mommsen(1883)与 Kaibel(1890)都默默把石上古拼写
QVM规范化为CVM。请看上面两栏第 6 行,CIL 上的橙色框出了这一点。原石上写的是 QVM(前一节的照片可见)。EDCS、EDR、PHI 140601 都从印本继承了 "CVM"。唯有 PHI 175744 与 I.Sicily 保留 "QVM",而只有 I.Sicily 用 EpiDoc<choice>同时记录两种形式。1883 年 Mommsen 印刷版的文字规范化,一路延续传播到 2024 年。 - Word-spacing got invented. The actual stone has no word dividers. Mommsen (CIL) introduces them in some lines (e.g.
NAOIϹ ΙΕΡΟΙϹ). Kaibel (IG) runs everything together (ΝΑΟΙϹΙΕΡΟΙϹ). The modern databases mostly follow Mommsen, but the 19th-century split is still visible in PHI 175744 vs PHI 140601's transcription differences. - 词间空格是被"发明出来的"。原石上没有词间分隔符。Mommsen(CIL)在某些行加入空格(如
NAOIϹ ΙΕΡΟΙϹ),Kaibel(IG)则全部连写(ΝΑΟΙϹΙΕΡΟΙϹ)。现代数据库大多沿用 Mommsen 的版本,但 19 世纪的这个分歧依旧可在 PHI 175744 与 PHI 140601 的转写差异中看到。 - The Greek got a "diplomatic" version. Kaibel below the columns supplies a single normalized Greek sentence with proper accents, breathings, and spacing: Στῆλαι ἐνθάδε τυποῦνται καὶ χαράσσονται ναοῖς ἱεροῖς σὺν ἐνεργείαις δημοσίαις. Mommsen does not. This is the line the modern databases (PHI especially) descend from for their text. The "canonical" Greek text in PHI today is Kaibel's 1890 reading, four iterations away from the stone.
- 希腊文得到一个"规范"版。Kaibel 在两栏下方补上一行带重音、气音、词间空格的规范希腊文:Στῆλαι ἐνθάδε τυποῦνται καὶ χαράσσονται ναοῖς ἱεροῖς σὺν ἐνεργείαις δημοσίαις. Mommsen 没有。这条规范行就是现代数据库(尤其是 PHI)的希腊文文本来源。今天 PHI 中那条"标准"希腊文,其实就是 Kaibel 1890 年的读法,距离原石已转录四次。
What the digital era discarded entirely
数字时代彻底丢掉的东西
The Latin commentary paragraphs you can see in both print snapshots — Mommsen's argument about marble quality, Kaibel's quotation of Mommsen's verdict, both editors' interpretations of the bilingual stonecutter — are gone. None of the five modern databases has a structured field for "editor's interpretive commentary." EDR records "Textus secundum (6)" — meaning "text follows reference 6 [Manni Piraino 1973]." That's it. The argumentative scholarly culture of 19th-century epigraphy was replaced by a database row. I.Sicily's TEI <commentary> field is the only modern tool that admits prose commentary back into the structured record — and it's the only modern tool that does, which is why I.Sicily looks so much richer than the others on this case study. (For the underlying TEI block structure that supports this — msDesc, physDesc, history, apparatus, named entities, certainty markers — see the Atlas · EpiDoc deep-dive series.)
两个印本可见的拉丁文评注段落,Mommsen 关于大理石优质的论证、Kaibel 对 Mommsen 判断的引用、两位编者对石匠双语能力的解读,在数字时代全部消失。五个现代数据库中没有一个有"编者诠释性评注"这个结构化字段。EDR 仅记 "Textus secundum (6)",意为"文字从第 6 号参考文献(Manni Piraino 1973)",仅此而已。19 世纪铭文学的论证性学术文化被一行数据库记录所替代。I.Sicily TEI 的 <commentary> 字段是唯一允许散文评注重新进入结构化记录的现代工具,这也正是为什么 I.Sicily 在本案例研究中看起来比其他库都丰富得多。
§ 2Five database views五个数据库的视角
Five major epigraphic databases (and one notable absence) record this same inscription. Click any to open its actual record.
五大铭文数据库(外加一个引人注目的"缺席")都收录了这块铭文。点击卡片打开真实记录。
qum operum. Bibliography names three competing dates.qum operum。参考文献并列三种相互冲突的定年。cum operum (not qum) and listed as "undated". The same database lists this inscription twice with different transcriptions.cum operum(不是 qum),并标注为"无定年"。同一个数据库把它两次记录,转写却不同。§ 3Side-by-side comparison逐项对比
Same physical object. Five databases. Pick any field to see how they disagree.
同一件实物,五个数据库。每一栏都能看出彼此分歧。
| Field字段 | I.Sicily | EDR | EDCS | PHI 175744 | PHI 140601 |
|---|---|---|---|---|---|
| Database ID数据库编号 | ISic000470 |
EDR140617 |
22000882 |
PH175744 |
PH140601 |
| Indexed under主收录于 | Native ID本号 | Native ID + TM本号 + Trismegistos | Native ID本号 | IGLPalermo 139 | IG XIV 297 |
| Region地区 | Italy / Sicily / Palermo | Sic? / Panhormus? / Palermo? | unknown未知 | Sikelia — Prov. unkn. [Palermo] | Sikelia — Prov. unkn. [Palermo] |
| Inventory no.馆藏号 | Salinas, inv. 3574 | Salinas, inv. 8822 | not recorded未记录 | not recorded未记录 | not recorded未记录 |
| Width (cm)宽度 (cm) | 24.5 | 14.5 | — | — | — |
| Date range定年范围 | 1–200 CE | 100 BCE – 100 CE | — | late 2nd c. AD; or late Repub.; or 1st c. AD2 世纪末;或共和晚期;或 1 世纪 | undated无定年 |
| Latin line 12拉丁第 12 行 | qum operum (reg.: cum)(规范化:cum) |
qum (:cum) operum |
— | qum operum |
cum operum |
| Material analysis材料分析 | marble · 6 candidate quarries (pXRF) | marmor | — | — | — |
| Translation翻译 | English (Prag)英文(Prag) | — | — | — | — |
| Image图像 | tiled TIF 3680 × 5520 px + JPG | photo via gallery画廊页有照片 | scattered, when present偶有,未必有 | none无 | none无 |
| Bibliography (#)参考文献条数 | 25+ | 14 | few数条 | 3 | 1 |
| Commentary学术评注 | long, includes Punic-speaker debate详细,含"作者母语布匿"假说 | "text follows Manni Piraino 1973""文从 Manni Piraino 1973" | — | — | — |
| License许可 | CC-BY 4.0 | CC-BY-NC-SA 4.0 | unspecified未声明 | unspecified未声明 | unspecified未声明 |
| Persistent DOI持久 DOI | 10.5281/zenodo.4337543 |
— | — | — | — |
§ 4The seven issues七项问题
ID proliferation: one stone, eleven names.
编号泛滥:一块石头,十一个名字。
No single canonical identifier exists for this inscription. It is referenced by at least eleven distinct schemes — five born-digital database IDs, three print-corpus references, two epigraphic-bulletin references, and one persistent DOI. Worse, PHI alone uses two different IDs because its database is organized by which printed edition the text comes from, so the same physical inscription gets one PHI number per edition that published it.
这块铭文没有一个公认的"标准编号"。它至少在十一种不同的编号体系下出现:五个数字数据库 ID、三种印本丛刊参引、两种铭文学公报参引、一个持久 DOI。更麻烦的是,PHI 自己就用了两个 ID:因为 PHI 是按印本组织条目的,同一块铭文,被几次出版就出现几次。
The same museum, two different inventory numbers.
同一博物馆,两个馆藏号。
I.Sicily records the inscription as Salinas Museum inventory 3574 (with a former Museo Salnitriano number 51). EDR records it as Salinas inventory 8822. They cannot both be right. Either the museum changed inventory numbers and one database didn't update, or one of them transcribed wrong. Externally there is no way to tell which.
I.Sicily 记录的是萨利纳斯博物馆馆藏号 3574(旧 Museo Salnitriano 编号 51)。EDR 记录的是 8822。两者不可能都对。要么是博物馆改过号、其中一个库没跟上;要么是其中一个转录错误。从外部无从判断到底是哪一种情况。
Three different date ranges, four databases, four answers.
三种定年方案,四个数据库,四种结论。
The dating disagreement here would single-handedly distort the JDH paper's "epigraphic habit" curve at this inscription's location:
仅这一条铭文的定年分歧,就能把 JDH 论文的"铭文习俗"曲线在该位置扭曲到不可识别:
The "qum / cum" problem — two transcriptions of the same line.
"qum / cum"问题,同一行的两种转写。
Latin column, line 12. PHI 175744 transcribes qum operum — preserving the archaic spelling that helps date the text. PHI 140601 transcribes cum operum — silently normalizing it to standard Latin. Same database, same inscription, different transcriptions.
拉丁栏第 12 行:PHI 175744 转写为 qum operum,保留古拼写,正是帮助定年的关键证据。PHI 140601 转写为 cum operum,静默地规范化为后期拉丁文。同一个数据库,同一块铭文,两种转写。
qum operumcum operum<choice><orig>qum</orig><reg>cum</reg></choice> operum<choice> markup makes both the original and the regularized form recoverable from a single record.
,想要在文本中检索"古拼写"的研究者(这是真实存在的研究问题,见参考文献中的 Kruschwitz 2000)通过 PHI 175744 能找到这块铭文,通过 PHI 140601 就找不到,EDCS 也要看它沿用了哪一版的转写。只有 I.Sicily 的 EpiDoc <choice> 标记同时记录原文与规范形式,下游可以从同一条记录中两边都恢复。
Image presentation: from a tiled TIF to nothing at all.
图像呈现:从分块 TIF 高清图,到完全没有图像。
The image situation across these databases is wildly asymmetric:
五个库在图像方面差异极大:
- I.Sicily — high-resolution tiled TIF (3680 × 5520 px) plus a print JPG, both encoded in the EpiDoc
<facsimile>element with explicit attribution and license (CC-BY 4.0). - I.Sicily:高分辨率分块 TIF(3680 × 5520 px)以及一个印刷用 JPG,二者皆通过 EpiDoc
<facsimile>元素编码,附带明确署名与许可(CC-BY 4.0)。 - EDR — typically has photos but on a separate gallery page; no anchored regions, no IIIF.
- EDR:通常有照片,但放在独立的画廊页;没有图像分区锚定,没有 IIIF 标准接入。
- EDCS — image presence is inconsistent across records; when present, it's a flat JPG with no metadata.
- EDCS:图像有无在不同条目间参差不齐;即使有,也是没有元数据的扁平 JPG。
- PHI (both records) — no images at all. PHI is a text-only corpus by design.
- PHI(两条都是)—— 完全没有图像。PHI 设计上就是纯文本语料。
The text↔image anchoring problem.
文字↔图像锚定问题。
Even where images do exist, the relationship between the transcribed text and the photograph is rarely explicit. EpiDoc supports <facsimile> with <zone> elements that can pin each line (or even each character) to pixel coordinates on the photo. Almost no databases use this capability fully.
即使有图,"转写文本"与"照片"之间的关系也鲜有明确编码。EpiDoc 支持 <facsimile> + <zone> 把每一行(甚至每个字符)固定到照片像素坐标。几乎没有数据库充分使用这一能力。
I.Sicily's TEI declares letter-height measurements per line — line 1 = 22 mm, line 2 = 20 mm, line 3 = 8 mm, lines 4–7 = 10 mm — but does not tag pixel zones on the photograph. The reconstruction below is built directly from those measurements and from the actual photograph of the stone. Click any line in the right panel (or hover the SVG) to see exactly what zone-anchoring would look like, were any of the five databases publishing it.
I.Sicily 的 TEI 声明了每行字高,行 1 = 22 mm;行 2 = 20 mm;行 3 = 8 mm;行 4–7 = 10 mm:但没有在图像上标注像素分区。下方的摹本严格依据这些测量数据与原石照片绘制。点击右侧某一行(或将鼠标悬停在 SVG 上),就能看到,若五库中任何一个肯发布,文图分区锚定该长什么样。
Hover or click any line — its zone lights up on the SVG and its declared letter height appears on the right edge of the plaque.
将鼠标悬停或点击任意一行,对应分区会在 SVG 上亮起,并且该行的字高也会显示在石板右缘。
L1 — ϹΤΗΛΑΙ · TITVLI 22 mm L2 — ΕΝΘΑΔΕ · HEIC 20 mm L3 — ΤΥΠΟΥΝΤΑΙ ΚΑΙ · ORDINANTVR ET 8 mm L4 — ΧΑΡΑϹϹΟΝΤΑΙ · SCVLPVNTVR 10 mm L5 — ΝΑΟΙϹ ΙΕΡΟΙϹ · AIDIBVS SACREIS 10 mm L6 — ϹΥΝ ΕΝΕΡΓΕΙΑΙϹ · QVM OPERVM 10 mm L7 — ΔΗΜΟϹΙΑΙϹ · PVBLICORVM 10 mmThe actual photograph is hosted by I.Sicily under CC-BY 4.0; this SVG is a stylized reconstruction matching the published letter-height measurements. View the real photograph ↗
真实照片由 I.Sicily 以 CC-BY 4.0 许可托管;此 SVG 为按照已发布字高测量值的风格化重绘。查看原照 ↗
Notice three things the SVG demonstration makes concrete:
这一演示具体呈现了三件事:
- The dramatic letter-height jump between lines 2 and 3 (20 mm → 8 mm) is itself a paleographic feature. It signals that the cutter laid out the headline ("STELAI · TITULI / ENTHADE · HEIC") in display capitals, then continued the body in a markedly smaller script. A purely text-based dataset row has no way to encode this, but a zone-anchored facsimile preserves it for free.
- 第 2 行到第 3 行字高的剧烈下落(20 mm → 8 mm)本身就是一个字形学事实。说明刻工先用大字标题刻出"ϹΤΗΛΑΙ · TITVLI / ΕΝΘΑΔΕ · HEIC"两行,然后用明显更小的字接刻正文。纯文本数据集的某一行无从编码这一事实;但带分区锚定的摹本免费保留了它。
- The dividing groove visible down the middle is itself a typographic decision — the cutter physically separates the two languages with a deeply incised vertical line, treating them as parallel columns rather than running text. None of the five databases encodes "deeply incised vertical column divider" as a structured field.
- 正中那条贯穿到底的分隔凹槽本身就是一个版面决定,刻工用一道深刻的竖线把两种语言物理分开,让它们成为并置两栏而非连续文本。五个数据库中没有任何一个把"深刻竖向分栏沟"当作一个结构化字段。
- The orange ferruginous staining visible in the photograph (and reproduced as small spots in this SVG) is metadata about conservation history, not the inscription itself. It belongs to material analysis. Only I.Sicily's TEI
<objectDesc>/<condition>can express such things; the other databases have nowhere to put them. - 照片上可见的橙色铁锈状斑点(在此 SVG 中以小斑点重现)是关于保存史的元数据,不是铭文本身。它属于材质分析。只有 I.Sicily 的 TEI
<objectDesc>/<condition>能表达这类信息;其他数据库无处可放。
<facsimile> with <zone> elements would close, were it adopted across the corpus rather than at one project alone.
,没有文字-图像锚定,编辑者的修正(如"第 5 行第三个字母其实是 sigma 而非 epsilon")无法在数据层核验;行 2 与行 3 之间剧烈的字高变化无法量化;分隔凹槽无从检索;橙色锈斑也不能在历次修复中被追踪。视觉证据沦为只剩元数据描述、与产生它的证据脱钩。这种脱钩正是传统铭文学者觉得"铭文作为数据"过于削减的原因,也正是若把 EpiDoc <facsimile>+<zone> 编码全语料推广(而非只有 I.Sicily 一家)能够弥合的鸿沟。
The EDH gap.
EDH 的缺口。
The Heidelberg corpus — which the JDH paper repeatedly invokes as its highest-quality dataset — does not include this inscription. The I.Sicily TEI explicitly leaves the <idno type="EDH"/> field empty. Sicily and southern Italy have historically been outside EDH's scope (it focused on the Latin western provinces and the limes). The inscription is in EDCS, but EDCS-only analyses inherit all of that database's editorial roughness.
海德堡语料,即 JDH 论文反复称为"最高质量数据集"的那一个,并不包含这块铭文。I.Sicily 的 TEI 明确把 <idno type="EDH"/> 字段留空。西西里与意大利南部历来不在 EDH 的覆盖范围(它聚焦于拉丁西部行省与边境地带)。这块铭文倒是在 EDCS 里,但只用 EDCS 做分析,就得继承 EDCS 整个数据库相对粗略的编辑层。
§ 5A merge simulator合并模拟
If you fed all five database records into the SDAM LI_ETL deduplication pipeline, here's what would happen.
如果把全部五条记录都喂给 SDAM 的 LI_ETL 去重流水线,结果会是这样:
inv: 3574
width: 24.5
inv: 8822
width: 14.5
inv: —
width: —
line12: qum
—
line12: cum
—
Then any analysis "counting Latin inscriptions" inherits a 4× over-count for this object (I.Sicily + EDR + EDCS + PHI 175744 + PHI 140601 ≈ 5 records for one stone), plus a phantom "undated" record from PHI 140601 that drops out of every chronological aggregation. This is exactly what the JDH paper's deduplication step exists to prevent — and exactly why it's so hard to do automatically.
那么任何"统计拉丁铭文"的分析都会因为这一件文物多算 4 倍(I.Sicily + EDR + EDCS + PHI 175744 + PHI 140601 ≈ 五条记录对应一块石头),并多出 PHI 140601 那条"无定年"的幽灵记录,它会从所有按时间汇总的分析中消失。JDH 论文的去重步骤正是为防此而设,也正是为何"自动去重"如此之难。
§ 6Why this hinders the JDH paper's analyses为何这阻碍了 JDH 论文的分析
The Heřmánková–Kaše–Sobotková paper is one of the most rigorous attempts to do macro-history with this kind of data. This single inscription shows where every single one of its careful methodological moves still has to absorb cost.
Heřmánková–Kaše–Sobotková 三人的论文是用此类数据做宏观史最严谨的尝试之一。这一块铭文恰恰展示:他们每一个谨慎的方法论举措,依然要付出代价。
- The "epigraphic habit" curve (Fig 1): one inscription with a 200-year date range contributes a flat plateau of 0.005 inscriptions/year × 200 years to the empire-wide aggregate. Across hundreds of thousands of inscriptions this averages out — but only if the date ranges are consistent. Here, four different ranges from four databases would each shift the local curve.
- "铭文习俗"曲线(图 1):一条带 200 年区间的铭文,对帝国整体的贡献是 0.005 条/年 × 200 年的均匀平台。汇总到几十万条铭文时,平均化能掩盖个体差异,但前提是日期范围本身一致。这里四个数据库给四个不同范围,每一个都会把局部曲线推向不同方向。
- Provincial distribution (Figs 4–6): EDR's "Sic?" with a question mark cannot be aggregated cleanly with EDH's confident "Sicilia" attribution. A clean province-rank-order plot must either drop "Sic?" (losing data) or treat it as confident "Sicilia" (overcounting).
- 行省分布(图 4–6):EDR 标注的 "Sic?"(带问号)无法干净地与 EDH 确凿的 "Sicilia" 聚合。要画一张干净的"行省次序图",要么丢掉 "Sic?"(损失数据),要么把它当作确凿的 "Sicilia"(超统计)。
- Type distribution (Fig 2): I.Sicily classifies this inscription as
function.advertisement(EAGLE term 128). EDR doesn't have an explicit type for "advertisement" — it falls undercetera(other). PHI doesn't classify by function at all. The same physical object would be counted as "advertisement" in one analysis, "other" in another, "no type" in a third. - 类型分布(图 2):I.Sicily 把这块铭文分类为
function.advertisement(EAGLE 词表 128)。EDR 没有明确的"广告"类型,落入cetera(其他)。PHI 根本不按功能分类。同一件实物,在一种分析中计为"广告"、在另一种中计为"其他"、在第三种中"无类型"。 - The bilingual question: I.Sicily encodes this as
biversion.duplicatingLatin + Greek with a structured taxonomy (textLang mainLang="la" otherLangs="grc"). EDCS haslatina-graecaas a flat string, EDR haslatina-graecaas a flat string, PHI catalogs it under "Greek" because it's in PHI Greek. Aggregating "how many Latin inscriptions are bilingual?" is therefore a labyrinth. - 双语问题:I.Sicily 把这块铭文编码为
biversion.duplicating(双版本对译),并使用结构化的语言分类(textLang mainLang="la" otherLangs="grc")。EDCS 标latina-graeca,但只是平面字符串。EDR 也是latina-graeca字符串。PHI 把它归为"希腊文"因为它在 PHI Greek 数据库里。要回答"多少拉丁铭文是双语的",会陷入字段对照的迷宫。
§ 7What I.Sicily models wellI.Sicily 提供的范例
It is fair to single out one record as a benchmark. I.Sicily's TEI for ISic000470 demonstrates, in a single open-access XML file, what comprehensive data-aware epigraphy looks like:
单独把一条记录作为基准并无不公。I.Sicily 为 ISic000470 提供的那份 TEI 文件,集中展现了"数据意识"完备的铭文学应具备什么:
- Identifier crosswalk — every external ID (EDR, EDCS, PHI ×2, TM, DOI, all the print refs) is recorded in
publicationStmt. - 编号对照表:在
publicationStmt中记录每一个外部编号(EDR、EDCS、两个 PHI、TM、DOI、所有印本参引)。 - Linked authority files — Pleiades for the ancient place (Panhormus = pleiades 462410), GeoNames for the modern (Palermo = 2523920), Eagle Network vocabularies for inscription type and material, ORCID for editors.
- 权威文件链接:古地名链接 Pleiades(Panhormus = 462410)、现代地名链接 GeoNames(Palermo = 2523920)、铭文类型与材质链接 Eagle Network 词表、编辑者链接 ORCID。
- Original alongside cleaned —
<choice><orig>qum</orig><reg>cum</reg></choice>: the EpiDoc convention for keeping both forms recoverable from the same record. - 原值与清洗值并存:
<choice><orig>qum</orig><reg>cum</reg></choice>:EpiDoc 让原始形式与规范形式同时从同一条记录可恢复。 - Material analysis as scholarly contribution — pXRF candidate quarry list with the specific scholar (Alessia Coccato) credited.
- 把材质分析当学术成果:pXRF 候选采石场列表,并具名(Alessia Coccato)。
- Versioned provenance —
<revisionDesc>records every edit to the record, dated and signed by ORCID. - 带版本号的修订记录:
<revisionDesc>记录对该条目的每一次修改,带日期与 ORCID 签名。 - License + DOI — explicit CC-BY 4.0, with a Zenodo DOI for citing this specific record.
- 许可 + DOI:明确 CC-BY 4.0;附 Zenodo DOI,可引用本条目特定版本。
If every Latin and Greek epigraphic database recorded inscriptions to this standard, the JDH paper's data-construction story would be much shorter — and macro-historical aggregation across databases would be tractable instead of artisanal.
如果所有拉丁文与希腊文铭文数据库都按这一标准记录铭文,JDH 论文的"数据构建"叙事会大大缩短,跨库宏观汇总也将由"手工活"变为"可计算"。
§ 8Sources for this case study本案例的出处
The five database records used
使用的五条数据库记录
- I.Sicily ISic000470 — primary record, with linked TEI EpiDoc XML.
- ISic000470 raw TEI XML — the source of every detail in this walkthrough.
- EDR 140617
- EDCS-22000882
- PHI 175744 (IGLPalermo 139)
- PHI 140601 (IG XIV 297)
Print editions cross-referenced
交叉参考的印本
- Mommsen, T. (1883). CIL X, no. 7296.
- Kaibel, G. (1890). IG XIV, no. 297.
- Manni Piraino, M. T. (1973). IGMusPalermo, no. 139.
- Bivona, L. (1970). ILMusPalermo, no. 74.
- Wilson, R. J. A. (1990). Sicily under the Roman Empire, p. 314 fig. 266.
- Tribulato, O. (2011 / 2012). On the bilingualism of this inscription.
- Consani, C. (2021). On the inscription as the work of a Latin speaker translating literally into Greek.
The methodological frame
方法论框架
- Heřmánková, P., Kaše, V., & Sobotková, A. (2021). Inscriptions as data: digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1). doi.org/10.1515/jdh-2021-1004 — see the Paper Edition for a section-by-section walkthrough.
- Kaše, V., Sobotková, A., & Heřmánková, P. (2023). Modeling Temporal Uncertainty in Historical Datasets. CHR 2023. CEUR-WS
Companion editions
配套版本
- Paper Edition — JDH 2021 walkthrough.JDH 2021 论文逐节导读。
- Reference Edition — deep technical companion to all 37 SDAM repositories.SDAM 37 个仓库的深度技术配套。
- Visual Edition — 19-slide interactive intro to ETL.ETL 的 19 张交互式幻灯片导览。
- Landing page — choose your starting point.选择入口。