From the Stone从石头说起
Portal门户 Visual视觉版 Case Study案例 Paper论文版 Reference参考版 Database literacy数据库识读 Atlas →地图 →
A database-literacy tutorial · 数据库识读教程

How far from the stone?

我们离这块石头到底有多远?

A single Sicilian inscription appears in six places on the web. Each version remembers something different. Each forgets something different. This tutorial walks the gap between the slab in Palermo and the row in your CSV.

同一块西西里铭文在网络上有 个落点。每一个版本记得的东西都不一样,遗忘的也不一样。本教程沿着这条裂缝走,从巴勒莫的那块石板,走到你 CSV 文件里的那一行。

I.Sicily EDCS PHI ×2 EDR OSU KB EDH (absent)

Use ← → to navigate · click any underlined term for a definition · 点击任何带下划线的术语查看注释

使用 ← → 翻页 · 点击任何带下划线的术语查看注释

The orienting question · 总问题

When you analyse "500,000 inscriptions", what are you actually analysing?

当你说"分析 五十万件铭文"时,你究竟在分析什么?

Every line in your CSV is the residue of a long chain. Stone → squeeze or photograph → printed transcription (CIL, IG, AE) → re-keyed into a database → cleaned by regex → exported as JSON → loaded into pandas. Each step carries a quiet loss. The closer you get to the stone, the fewer things you can do at scale; the farther you get from it, the more you can compute and the less you actually know.

CSV 中的每一行都是一条长链的残余。石头 → 压模或摄影 → 印本转写(CIL、IG、AE)→ 重新键入数据库 → 用正则清洗 → 导出为 JSON → 加载进 pandas。每一步都有一份默默的丢失。越靠近石头,可以批量做的事越少;越远离石头,能算的越多,可你实际知道的越少。

The thesis of this tutorial: "the digital corpus" is not one thing. It is a federation of partial views, each shaped by what its makers prioritised. Knowing each view's shape is the first move in honest computational epigraphy.

本教程的命题:所谓"数字化的语料库"不是一个东西。它是一个 由若干局部视角组成的联邦,每个视角都被其制作者的优先取向所塑形。看清每个视角的形状,就是诚实的"计算铭文学"的第一步。

The demonstrative example · 示例铭文

CIL X 7296 = IG XIV 297 — the bilingual stonecutter's sign from Panhormus / Palermo

CIL X 7296 = IG XIV 297 — 一块来自 Panhormus / 巴勒莫 的双语石匠铺招牌

CIL X 7296 / IG XIV 297 — bilingual stonecutter's plaque from Palermo
The actual stone — Museo Archeologico Salinas, Palermo. Each database we visit on the next slides is reaching after this object.实物,巴勒莫 Salinas 国家考古博物馆。下面将走访的每一个数据库都是在追这块石头。

What it is

这是什么

A small marble plaque (~30 × 47 cm) once mounted at a Sicilian stonecutter's workshop, advertising its services in both Latin and Greek. Probably 1st century CE.

一块小型大理石板(约 30 × 47 厘米),曾立于一家西西里石匠铺前,用 拉丁语与希腊语 双语为店主招揽生意。年代大约公元 1 世纪。

Provenance: Palermo, Sicily. The plaque doubles as a sample of the workshop's lettering — a sign cut by the cutter, advertising the cutter.

出土:意大利西西里,巴勒莫。这块招牌同时也是店家手艺的样品,由石匠亲手刻下,给路人示范他能刻什么。

What it says

写了什么

Tituli heic ordinantur et
sculpuntur, aidibus sacreis,
qum operum publicorum.

στῆλαι ἐνθάδε τυποῦνται
καὶ χαράσσονται ναοῖς
ἱεροῖς σὺν ἐνεργείαις
δημοσίαις.

"Inscriptions are designed and cut here, [for] sacred buildings, with [those] of public works." Same content, twice, for two reading communities.

"此处可设计、雕刻铭文,[供]神庙[使用],并兼接公共工程之刻字。"内容一样,写两遍,给两批读者。

Why this stone? It is unusually well-attested — it appears in five online databases plus a university photograph repository — and unusually well-suited to the comparison, since both halves of the bilingual text foreground the act of cutting itself.

为何选这块?它在网上的存在感 异常密集:五个数据库加一座大学照片库都收录了它,且双语两端都把"刻字"这一动作摆在最前面,恰好适合用来谈"数字版本之间的差异"。

Step 1 · 第一步

What an inscription database actually is

铭文数据库 到底是什么

An inscription database is not a digital copy of a stone. It is a columnar table: rows are inscriptions, columns are decisions about what to record. Every column reflects an editorial bet — "we believe province / language / dating window / material / findspot are queryable; so we will fix a controlled vocabulary for them and ignore the rest."

铭文数据库不是石头的数字副本。它是一张 列状表格:行是铭文,列是关于"该记录什么"的决定。每一列都包含一种编辑式赌注,"我们认为省份/语言/年代窗口/材质/出土地是可查询的字段;那就给它们规定受控词表,其余的就忽略。"

What columns are good for

列结构擅长什么

  • Counting (how many inscriptions in Sicilia)
  • 计数(西西里行省共有多少条铭文)
  • Filtering (only Latin, only first century)
  • 筛选(只要拉丁文、只要公元一世纪)
  • Joining (this stone with that geography)
  • 联接(这块石头与某条地理记录配对)
  • Mapping (lat/long → dot on a map)
  • 作图(经纬度 → 地图上一个点)

What columns are bad for

列结构不擅长什么

  • Anything that varies continuously (chiseling depth, surface wear, paint traces)
  • 任何连续变化的属性(雕凿深浅、风化、留色)
  • Anything that depends on standing in the room (viewing angle, light, scale of the lettering against a body)
  • 任何"必须站在原地才能感受"的属性(观看角度、光照、字与人体的尺度对比)
  • Anything where editorial judgement disagrees (date, provenance, restorations)
  • 任何编辑判断不一致的属性(年代、出处、修复)

Autopsy — physically examining the stone — sits behind every "good" record. But autopsy doesn't scale. So databases trade depth for breadth: they record what they can encode, then move on.

实地察看(亲自去看石头)是每一条"好记录"背后的源头。但实地察看不能批量。于是数据库选择以广度换深度:能 编码 的就记下来,剩下的就略过。

This is a 200-year-old trade-off. Mommsen's CIL (1853–) was already trading editorial depth (autopsy + apparatus + bibliography) for systematic coverage of the Latin world. Modern databases inherit the trade-off and adjust the ratio. → Full lineage这是一项 200 年的取舍。Mommsen 的 CIL(1853–)已在以"编辑深度(实地察看 + 校勘 + 参考书目)"换"对拉丁世界的系统覆盖"。现代数据库继承了同一取舍,只调整比例。→ 完整谱系

Step 2 · 第二步

Six landings — where this same stone appears on the web

六个落点,同一块石头在网上的六个位置

Click each card to open that database's record in a new tab. Notice that even the URLs disagree: each project has chosen its own ID system, and there is no master index that resolves them all.

点击每张卡片在新标签页中打开该数据库的记录。注意连 URL 都互不相同:每个项目都自定义了一套 ID 系统,目前没有一个总索引能把它们全部串起来。

Plus a meaningful absence: this stone is not in the EDH, the most photo-rich Roman-provincial database. Why? Because EDH's geographical scope stops at the Roman provinces; Italy itself is technically out-of-scope. The richest database is also the wrongest one to consult here.

外加一个有意义的缺席:这块石头 不在 EDH 当中,而 EDH 是图像最丰富的罗马行省数据库。为什么?因为 EDH 的地理范围只收"行省",意大利本土不在其中。图像最丰富的那个库,恰恰是查这块石头时最不该用的库。

Side-by-side · 并排对比 · 1/3

I.Sicily vs EDCS — the two best-equipped versions

I.SicilyEDCS,两个最完备的版本

I.Sicily — ISic000470

format
TEI EpiDoc XML — every line break, every restoration, every editor's certainty marker is a structured element
image
IIIF server with deep-zoom tiles; Mirador viewer embedded
text
three layers: diplomatic, normalised, interpretive
dimensions
H 30 cm · W 47 cm · D 6 cm · letter h. 1.7–2.4 cm
geo
Pleiades 462492 (Panhormus) + present location: Museo Archeologico Salinas, Palermo
licence
CC BY 4.0; full XML on GitHub
strength
Reproducibility: every editorial decision is encoded and traceable.

EDCS — EDCS-22000882

format
flat tabular record; one row per inscription
image
one external photo field with concatenated JPG URLs (no IIIF, no zoom)
text
three layers in three fields: raw inscription, conservative_cleaning, interpretive_cleaning
dimensions
— not recorded —
geo
province "Sicilia"; place "Palermo / Panhormus"; 38.1115, 13.3516
licence
site terms; bulk dumps via partner ETL pipelines
strength
Coverage: 537,000+ records. The widest net for Latin epigraphy on the open web.

Disparity: I.Sicily knows the dimensions of the slab and lets you zoom into a single letter. EDCS records 36× more inscriptions but doesn't know how big any of them is. Both can be useful — for different questions.

差异:I.Sicily 知道这块板有多大,并允许你缩放到单个字母。EDCS 收录了多 36 倍的铭文,却不知道任何一块石头的尺寸。两者各有所长,取决于你问的是什么问题。

Side-by-side · 并排对比 · 2/3

PHI 175744 vs PHI 140601 — the same stone, twice

PHI 175744PHI 140601,同一块石头,被收两次

PHI 175744

format
text-only HTML, no metadata vocabulary
image
— none —
text
Greek face only, presented as Leiden-edited prose
attribution
references IG XIV 297 in the header
id
opaque numeric ID; no meaningful meaning to the number

PHI 140601

format
same text-only HTML
image
— none —
text
also the Greek face, but cited differently in the header (the SEG 50.1016 entry)
attribution
references SEG 50.1016 in the header
id
a different opaque ID for the same physical stone

This is normal, not pathological. PHI ingests records from many publication series (IG, SEG, AE, journal articles). When the same stone is published in two different series, PHI ingests it twice — under two PHI IDs. There is no automatic dedup. If you query "how many Greek inscriptions mention τυποῦνται" you will double-count this one.

这是常态,不是事故。PHI 从多套出版物(IG、SEG、AE、期刊论文)抓取记录。当同一块石头被两套出版物各自著录,PHI 就把它抓两次,给两个 PHI ID。没有自动去重。如果你查"希腊文铭文中提到 τυποῦνται 的有多少件",这一块会被 计两次

PHI is the canonical Greek epigraphy text database. Strong on findable text, weak on everything else. Counts done on PHI alone — without joining to a richer record — are unreliable.

PHI 是希腊铭文的标准文本数据库。它的强项是 可检索的文本,弱项是 除文本以外的一切。仅在 PHI 上做计数,不与更丰富的记录联接,结果是不可靠的。

Side-by-side · 并排对比 · 3/3

EDR vs OSU KB — an Italian database vs a single photograph

EDROSU KB,一个意大利语数据库 与 一张孤立照片

EDR — EDR140617

format
structured Italian-language record; text editor's transcription + commentary
image
one photograph (low-resolution; no IIIF)
text
Latin face + Greek face, both transcribed
extras
Italian-language editorial notes (commento), bibliography
id
EDR uses geographical numbering — id_nr=140617
scope
Italy + Italian islands. Sicily falls in.

OSU KB — handle 1811/99008

format
DSpace item; one digital object + Dublin Core metadata
image
one JPG · 3.89 MB · 3500 × 2300 px (rough autopsy substitute)
text
— no transcription —
extras
bibliographic relation field cites Dessau ILS 7680, IGRRP 503, ZPE 130 / 177, SEG 50.1016, AE 1982 / 2000 / 2011
id
UUID + handle (1811/99008); not an epigraphy ID
scope
A single photograph in a university repository — not an epigraphy database at all.

Lesson: Some "records" of an inscription are not records of inscriptions at all — they are records of photographs, kept by libraries on photo-management terms (Dublin Core, DSpace, ARK identifiers). The bibliography lives outside the photograph; the photograph lives outside the database. Joining them up requires you, the human reader, to do the work.

教训:一些"铭文记录"其实根本不是铭文记录,它们是 照片记录,被图书馆按照"图像管理"的逻辑保存(Dublin Core、DSpace、ARK 标识符)。文献列表在照片之外,照片在数据库之外。把它们串起来的工作,要由你,真人读者,自己来做。

A meaningful absence · 一个有意义的缺席

Why this Sicilian stone falls out of the most photo-rich Roman database

为什么这块西西里石头会 不在 那个图像最丰富的罗马数据库里

The Epigraphic Database Heidelberg (EDH) is the most photo-rich Latin epigraphy database, with high-resolution images and squeeze archives. But its scope is the Roman provinces — the imperial territory ruled from Rome. Italy itself, including Sicily, is technically considered not a province but the heartland, and so is normally excluded from EDH's records.

海德堡铭文数据库(EDH)是图像最丰富的拉丁铭文数据库,有高分辨率照片和压模档案。但它的 范围罗马行省:即罗马所统治的帝国领土。意大利本土(包括西西里)按惯例 不算行省,而是被视为本土核心,因此通常不被 EDH 收录。

Verified empirically. A search of the EDH bulk dump (EDH_attrs_cleaned_2023-04-26.json, 81,883 records) for the strings CIL 10, 7296, CIL X 7296, ISic000470, IG XIV 297, IG 14, 297 yields zero hits. The stone is not there.

经实证核查。在 EDH 批量导出(EDH_attrs_cleaned_2023-04-26.json,共 81,883 条记录)中搜索 CIL 10, 7296CIL X 7296ISic000470IG XIV 297IG 14, 297零结果。这块石头确实不在其中。

This is the first lesson of database literacy: the absence of a record is itself information. If you query "where are inscriptions about stonecutters?" against EDH alone, you will conclude — wrongly — that there were none in the Italian heartland. The conclusion is an artefact of EDH's scope, not of antiquity.

这是数据库识读的第一课:记录的缺席本身就是信息。如果你只在 EDH 上查"提到石匠的铭文都在哪里",你会得出,错误的,结论:"意大利本土没有这样的铭文。"这一结论不是古代的事实,而是 EDH 范围的副作用。

Step 3 · 第三步

Information disparity matrix — 12 fields × 6 databases

信息差异矩阵,12 字段 × 6 数据库

Field字段 I.Sicily EDCS PHI ×2 EDR OSU KB EDH
Latin transcription拉丁转写
Greek transcription希腊转写
Diplomatic text外交本conservativepartial
Interpretive text解释本
Photograph照片IIIFJPG URLJPGJPG (only)
Dimensions尺寸H/W/D + letter h.partial
Material材质marblelapismarmo
Findspot lat/long出土地经纬度via Pleiades38.1115, 13.3516via place
Date range年代范围−100 / +100−100 / +100verbalnumericverbal
Bibliography参考文献structuredin publication fieldheader onlystructuredfree-text
Cross-IDs (partner_link)跨库 IDvia TM, Pleiadesexplicit fieldin commento
Open licence开放授权CC BY 4.0site termssite termssite termsCC BY

Read down a column: that's what one database knows. Read across a row: that's how unevenly one fact is recorded across databases. Notice that no single database has every column filled. Compositional knowledge requires querying many at once.

沿一列竖读:单一数据库知道什么。沿一行横读:同一事实在不同数据库中被记录得多么不均。注意 没有任何一个数据库把每一格都填满。要拼出"组合性的知识",必须同时查询多个。

For the family taxonomy behind these columns:对应六大家族分类: Atlas · Six families · Atlas · Comprehensive 28-row table

Step 4 · 第四步

The image question — how many inscriptions actually have one?

图像问题,有多少铭文真的附图?

"EDCS and EDH provide images" is true at the database level. But within those databases, image coverage is sparse. We can quantify this directly from the bulk data dumps.

"EDCS 和 EDH 提供图像",在数据库层面这是对的。但在数据库内部,图像覆盖率非常稀疏。我们可以直接从批量导出数据中量化这一点。

EDCS — empirical

total
537,262 records
with photo URL
141,036 records
share
26.3 %
with partner_link
274,338 records (51.1 %)
source
EDCS_merged_cleaned_attrs_2022-09-12.json (SDAM ETL public dump)

EDH — empirical

total
81,883 records
photo field in dump
— absent —
finding
The SDAM EDH ETL did not extract image URLs. Image presence is itself information that the pipeline dropped.
implication
"Does EDH have an image?" is not answerable without going back to the EDH website record-by-record.
source
EDH_attrs_cleaned_2023-04-26.json (SDAM ETL public dump)
26.3% EDCS records with a photo URLEDCS 记录中附照片链接的比例

Three out of four EDCS records have no image at all. For 396,000 inscriptions, the database is simply text — no visual control on the editor's reading.

EDCS 中四分之三的记录 没有 任何图像。对于 39.6 万条铭文,数据库就是纯文本,编辑者的释读没有任何视觉佐证。

Step 5 · 第五步

Three states of the same text — how far from what's on the stone?

同一段文字的三种状态,离石头表面有多远?

EDCS records the inscription's text three times in three different fields. Each is a different distance from the stone's actual surface. Reading them side by side teaches you what "the text" means in epigraphy: it depends on which version you took.

EDCS 把铭文文本以三种不同方式存了三遍,分别置于三个字段中。每一种距离石头表面的远近不同。三种并排读,能学会"铭文文本"在铭文学中究竟意味着什么,取决于你拿了哪一版。

L0
Stone surface石头本身
↳ irreducible. Even the stone has been damaged, restored, repainted.
paint, wear, depth, light留色、风化、深度、光线
L1
Diplomatic transcription (raw)外交本转写(原貌)
Tituli / h{e}ic / ordinantur et / sculpuntur / aidibus(!) sacreis(!) / <c=Q>um operum(!) / publicorum(!)
retains: line breaks, errors, in-line correction marks保留:换行、错字、行内校正符号
L2
Conservative cleaning保守清洗
Tituli heic ordinantur et sculpuntur aidibus sacreis Qum operum publicorum
drops: line breaks, brackets — keeps: archaic spellings (heic, aidibus, Qum)丢弃:换行与括号,保留:古拼写(heic、aidibus、Qum)
L3
Interpretive cleaning (modernised)解释本(现代化)
Tituli hic ordinantur et sculpuntur aidibus sacreis cum operum publicorum
"heic" → "hic", "Qum" → "cum". Easier to read, further from the stone."heic" 改为 "hic","Qum" 改为 "cum"。更易读,离石头更远。
L4
Translation译文
"Inscriptions are designed and cut here…"
Loses: original morphology, archaisms, the bilingual mirroring丢失:原始形态、古风、双语对照结构

Question for the class: if you ran a regex search for the exact string "heic" across the EDCS interpretive-cleaning column, you would miss this inscription. Is that an EDCS bug, or a feature?

课堂提问:如果你在 EDCS 的"解释本"列上用正则搜 "heic" 这个原拼写,会漏掉这条记录。这是 EDCS 的错,还是它的特性?

Step 6 · 第六步

Six IDs for one stone — and the field that tries to bridge them

同一块石头的六个 ID,以及试图把它们桥接起来的那个字段

Each database invented its own ID system. To find this stone elsewhere, you have to know all of them. EDCS makes a partial attempt: in its partner_link field it writes out the inscription's identifiers in five other systems.

每个数据库都自定义了一套 ID 系统。要在别处找到这块石头,你必须 同时知道 所有这些 ID。EDCS 部分尝试解决这个问题:在它的 partner_link 字段中把铭文在其他五种系统中的标识符列出来。

EDCS-22000882.partner_link =
https://edcs.hist.uzh.ch/en/
  N140617;          → EDR 140617
  cil10_p758;       → CIL 10 page 758 (the printed CIL volume)
  P140601;          → PHI 140601
  is000470;         → I.Sicily ISic000470
  t2016.031.04;     → Tyche 2016 article
  de-b0810;         → DAI Berlin photo archive
  oh_99008;         → OSU Knowledge Bank handle 1811/99008
  bu_CIL&volume=10&insc=7296   → Berlin CIL backend ref

What this field does not do:

这个字段 做的事:

  • It does not link to PHI 175744 (the second PHI record).
  • 没有 指向 PHI 175744(即 PHI 中的另一条记录)。
  • It is hand-curated; new partner records added later won't appear unless someone updates it.
  • 它是人工维护的;后来加入的合作记录不会自动出现,除非有人更新。
  • It is asymmetric: the OSU KB record's metadata does not link back to EDCS. Each project decides on its own whether to maintain partner links.
  • 它是不对称的:OSU KB 记录的元数据 没有 反向链接到 EDCS。是否维护合作链接由每个项目自决。

Trismegistos aspires to be the universal join key — a single TM-ID that resolves across all databases — but coverage is uneven and not every record carries one. EAGLE was meant to harmonise; in practice each federation member kept their conventions.

Trismegistos 试图扮演"通用联接键"的角色,一个能在所有数据库间解析的 TM-ID,但覆盖参差不齐,并非每条记录都有。EAGLE 联盟原本希望统一,实际上每个成员都保留了自己的旧惯例。

Step 7 · 第七步

The bibliography cascade — how citations move further from the stone

参考书目的级联,引用链如何越走越远

EDCS records this stone's publication field as a single string of equal-signs:

EDCS 把这块石头的 publication 字段记录为一连串以等号串联的引用:

CIL 10, 07296 = D 07680 = IG-14, 00297 = IGRRP-01, 00503
   = ZPE-130-239 = ZPE-177-131 = SEG-50, 01016
   = AE 1982, 00417 = AE 1999, 01543 = AE 2000, 00643
   = AE 2011, +00437 = AE 2012, +00014 = ZPE-205-145
   = Tyche-2016-54 = AE 2018, +00130 = AE 2018, +00756

That is the publication history of one stone, compressed. Reading top to bottom is reading forward through 130+ years of scholarship:

这是一块石头的出版史的压缩版。从上往下读,就是顺着 130 多年的学术史前进:

Citation引用 What it is是什么 Date年代 Distance from stone距石头
CIL 10, 7296CIL volume 10 — Mommsen's printed Latin transcription1883closest (autopsy)
IG XIV, 297IG volume 14 — Kaibel's printed Greek transcription1890closest (autopsy)
D 7680Dessau, ILS — selected reprint with light commentary1892–1916+1 step
IGRRP I, 503Cagnat, Inscriptiones Graecae ad Res Romanas Pertinentes1911+1 step
ZPE 130, 239Kruschwitz article — sprachliche Anomalien2000discussion only
SEG 50, 1016Supplementum Epigraphicum Graecum — citation aggregator2003discussion only
ZPE 177, 131Tribulato article — bilingual stonecutter sign2011discussion only
AE 2018, 130L'Année Épigraphique — yearly citation roundup2018citation about citation

Most modern citations refer to the inscription via previous editions, not via fresh autopsy. By 2018, an article citing AE 2018 §130 is two or three steps removed from anyone who actually saw the stone.

绝大多数现代引用都是 经由 前人的版本来指这块石头,而非通过新一次的实地察看。到了 2018 年,一篇引 AE 2018 §130 的论文,与任何"亲眼见过石头的人"已隔了两到三层。

Step 8 · 第八步

The geography of databases — why a Sicilian stone needs five

数据库的地理学,一块西西里石头为何需要五个数据库

No epigraphic database covers everything. Each carved out a domain — by region, by language, by script, by period. Where a stone happens to land determines who notices it.

没有任何一个铭文数据库覆盖一切。每个都依据地区、语言、文字、时段划出自己的范围。一块石头落在哪里,决定了谁会注意到它。

Database数据库 Region/scope范围 Language语言 Approx. records记录数 Includes our stone?收本案?
EDHRoman provinces (excl. Italy)Latin (mostly)~82,000no — Italy is out of scope
EDCSany Latin inscriptionLatin (any)~537,000yes
PHIany Greek inscriptionGreek~250,000yes (twice)
EDRItaly + Italian islandsLatin (Italian-language metadata)~80,000yes
I.SicilySicily onlyany (Greek + Latin + Punic + Hebrew)~3,500yes
OSU KB / CEPSOSU's photographic archiveany (photo only)n/ayes (one photo)

The implication for queries: a question like "how often do Sicilian inscriptions mention public works?" is technically about all inscriptions found on Sicily — but no single database has them all. EDH is silent. EDCS has Latin Sicilian inscriptions but loses the bilingual halves. PHI has the Greek halves but loses the dating metadata. I.Sicily has both halves of some stones but only ~3,500 records total. The honest answer requires federating all of them — and accepting that joins will be partial.

对查询的含义:"西西里铭文中提及公共工程的频率有多高?",严格说这是一个关于 所有 西西里出土铭文的问题,但没有哪个单一库收齐了全部。EDH 不记。EDCS 有拉丁面,但希腊面被剥落。PHI 有希腊面,但年代元数据丢失。I.Sicily 二者皆有,但只覆盖约 3,500 条。诚实的答案要求 联邦化 这些库,并接受联接必定有缺口。

Step 9 · 第九步

The information-loss ladder — stone to CSV in seven steps

信息丢失阶梯,从石头到 CSV 的七级

1
Stone in the ground原石(出土前)
— still in archaeological context尚有考古语境
2
Excavated & moved to museum出土后搬入博物馆
loses: surroundings, orientation, depth in stratum丢失:周边、朝向、地层深度
3
Autopsy + squeeze / photograph实地察看压模/摄影
loses: third dimension, paint, surface oils, parallax丢失:第三维、留色、表面油脂、视差
4
Printed transcription (CIL, IG)印本转写(CIL、IG)
loses: visual field; gains: editorial annotations, brackets, conjectures丢失:视觉信息;获得:编辑注释、括号、猜测
5
Re-keyed into a database (EDH, EDCS, PHI, EDR)重新键入数据库(EDH、EDCS、PHI、EDR)
loses: editorial brackets, page layout; adds: structured fields (date range, geo)丢失:编辑括号、原版面;增加:结构化字段(年代范围、地理)
6
Cleaned via regex (SDAM ETL pipeline)正则清洗(SDAM ETL 管道)
loses: restorations, marker characters; produces: conservative_cleaning + interpretive_cleaning丢失:修复、标记符;产出:conservative_cleaning + interpretive_cleaning
7
Loaded into pandas / parquet for analysis加载进 pandas / parquet 做分析
treated as: a row equal to all other rows. Variation in evidentiary quality is invisible.被处理为:一行,与其他行等价。证据质量差异不可见。

Where you stand on the ladder dictates what you can see. A philological question (is "Qum" a graphical archaism for "cum"?) needs Step 4. A geo-statistical question (where do bilingual workshops cluster?) needs Step 7. Stepping up the ladder for a question it isn't built for produces wrong answers without warning.

你站在阶梯的哪一级,决定了你能看见什么。语文学问题("Qum"是否"cum"的字形古风?)需要第 4 级。地理统计问题(双语作坊在哪聚集?)需要第 7 级。把不该上层的问题上推一级,会得到错答,而且不会有警告。

Step 10 · 第十步

The circulation problem — when a row gets cited as a stone

流通问题,当一行数据被当成石头来引用

Once an inscription becomes a row in EDCS or a TEI XML in I.Sicily, the row begins to circulate independently. It gets quoted in journal articles, mapped onto interactive viewers, ingested by ML training pipelines, and counted in distant-reading studies. At each step, the row carries less and less of the chain that produced it.

一块铭文一旦在 EDCS 中变成一行、或在 I.Sicily 中变成一份 TEI XML,这一行就开始独立流通。它被期刊论文引用,被嵌入交互式查看器,被机器学习训练管道吞下,被"远距阅读"统计研究计数。每一次流通,行内携带的"产生这行的那条链"就少一点。

Honest circulation

诚实的流通

  • Carries: a citation, a date, an editor's identity, a confidence flag
  • 携带:引用、年代、编辑者身份、置信标记
  • Treats partial coverage as itself a finding, not a flaw to hide
  • 把覆盖率不全本身 视作发现,而不是要遮掩的缺陷
  • Re-derives from primary records, not from prior derivative work
  • 从原始记录重新推导,而非沿袭前人衍生作品

Detached circulation

脱钩的流通

  • Carries: a number. The provenance is a footnote, often dropped
  • 携带:一个数字。出处只是脚注,常被略去
  • Treats database disagreements as noise to average out
  • 把数据库间的分歧当作"噪声"来平均掉
  • Re-circulates the same numbers in successive papers, each citing the previous
  • 同一组数字在论文间反复流通,每篇都引前一篇

A test: the next time you read a sentence like "There are X bilingual inscriptions from Roman Sicily", ask: which database produced X? Which fields did it count? Was the duplication into PHI 175744 / 140601 deduplicated? Did the count include I.Sicily's small set, or only EDCS? If the article doesn't say, the number is detached.

一个测试:下次读到 "罗马时期西西里岛共有 X 件双语铭文" 这样的句子,请追问:X 来自哪个数据库?数了哪些字段?PHI 175744 / 140601 这种重复条目去重了吗?是否纳入 I.Sicily 的小集?还是只用 EDCS?如果文章没说,这个数字就是脱钩的。

Step 11 · 第十一步

Five things you cannot recover from any database row

任何 数据库的一行中都恢复不了的五样东西

1
Chiseling depth and angle.雕凿深度与角度。 Whether the cuts are V-section (formal commission) or U-section (cheaper, faster). The economics of the workshop is invisible.凿刻是 V 形断面(正式订作)还是 U 形(廉价快做)。作坊经济结构在数据库里完全不可见。
2
Paint traces.留色痕迹。 Most ancient inscriptions were painted (red ochre, sometimes black). Almost none of this survives in databases.大多数古代铭文原本上过色(多为红赭,少量为黑)。这一信息几乎完全不在数据库中。
3
Surface wear and weathering.表面磨损与风化。 Whether you can read the stone today depends on weather, micro-climate, museum lighting, polishing. The text we have is what survived; we don't know how much we lost.今天能否读出这段文字,取决于天气、微气候、博物馆光线、抛光。我们读到的是"幸存的",但不知道丢了多少。
4
Viewing angle and scale.观看角度与尺度。 A 30 cm plaque mounted at eye level on a workshop wall is read very differently from one curated under a vitrine. Embodied scale shapes meaning.挂在作坊墙上、与人眼平齐的 30 厘米石板,与陈列在展柜中的同一块石板,被阅读的方式天差地别。"身体尺度"塑造意义。
5
Context — who else was nearby.语境,谁与它相邻。 A workshop sign next to a temple inscription means something different than a workshop sign next to a graffito. Spatial neighbours carry meaning that no row can encode.作坊招牌挂在神庙铭文旁、与挂在涂鸦旁,意义完全不同。空间邻居所携带的意义,无法被任何一行编码。

These are not gaps that better column choices would fix. They are the irreducible cost of trading depth for breadth. The honest digital epigrapher names them, then proceeds.

这些不是"加几列就能补上"的缺口,而是"以广度换深度"的不可化约成本。诚实的数字铭文学者会先命名它们,再继续工作。

Step 12 · 第十二步

There is no the digital version — there are six, slightly disagreeing

没有 那一 个数字版本,只有六个互有出入的版本

If you ask each of the six landings "what does this stone say in Latin?", you get six slightly different answers. Mostly the differences are typographic — punctuation, line breaks, whether brackets are visible — but a few are substantive.

如果你向六个落点分别问"这块石头的拉丁面写了什么?",你会得到六个稍有出入的答案。多数差异是排版性的,标点、换行、是否保留方括号,但少数是实质性的。

Source来源 Line 1 of Latin face拉丁面第一行 Treats "heic" as如何处理 "heic" Treats "Qum" as如何处理 "Qum"
I.Sicily (diplomatic)Tituli h{e}icpreserved with editorial markerspreserved with marker
I.Sicily (interpretive)Tituli hicnormalised to "hic"normalised to "cum"
EDCS (raw inscription)Tituli / h{e}icpreservedpreserved with <c=Q>
EDCS (conservative)Tituli heic"heic" preserved"Qum" preserved
EDCS (interpretive)Tituli hicnormalisednormalised
EDR (Italian record)Tituli heicpreservedpreserved as "Qum"
OSU KB (no transcription)— (only photograph) —n/an/a

The lesson: when you see "the EDCS text of CIL X 7296" in a paper, ask which EDCS text. There are at least three. The choice between them is editorial — and so are the conclusions you can draw from a search regex.

结论:当你在论文里读到"EDCS 中 CIL X 7296 的文本"时,请追问 哪一版 EDCS 文本。至少有三版。三选一是编辑判断,你能从一条正则得出的结论,也是。

Step 13 · 第十三步

Why the JDH 2021 method is honest about all of this

为何 JDH 2021 的方法 对此保持诚实

Heřmánková, Kaše & Sobotková (JDH 2021) — the methodological paper behind the SDAM ETL pipelines — articulates three commitments that respond directly to the problems on the previous slides:

Heřmánková、Kaše 与 Sobotková(JDH 2021),SDAM ETL 管道背后的方法论论文,提出三项承诺,正面回应上述问题:

Reproducibility

可复现

Every figure must be re-derivable from the same code and data. So readers can examine where database choices show up in conclusions.

每张图必须能由同一份代码与数据重新生成。让读者看清"数据库选择"在结论中如何显形。

Honest uncertainty

诚实的不确定性

Dating uncertainty must be propagated, not hidden. (Hence tempun Monte Carlo simulation over date ranges, instead of point estimates.)

日期不确定性必须向下传播,不能掩盖。(因此用 tempun 蒙特卡洛模拟来跑年代区间,而非点估计。)

Bias is content

偏差也是内容

Disagreement between EDH and EDCS is not noise to be averaged out — it's scholarly evidence about how the database was made.

EDH 与 EDCS 之间的差异不是要平均掉的噪声,它是关于 数据库是如何被造出来的 的学术证据。

Read the three Stance cards in Paper Edition § 7 for the original framing. The case study in Case Study Edition walks the same problem in seven concrete issues. The Databases Atlas classifies all the projects discussed here into six families.

原文表述见 论文版 § 7 的三张"立场"卡。案例版 把同一组问题以七个具体的"问题号"逐一展开。数据库地图 把这里讨论的所有项目归入六大家族。

Hands-on · 动手做

Try it on your own — three short exercises

动手做,三个小练习

Exercise 1

Pick any inscription from EDH. Now try to find the same inscription in EDCS, PHI, EDR. How many of the four hold a record? How many hold an image?

EDH 中任挑一条铭文。然后试着在 EDCS、PHI、EDR 中分别找到这同一块。四个库中有几个收录?几个有图?

Exercise 2

In EDCS's bulk dump, run regex search "heic" against (a) the raw inscription field, then (b) the interpretive_cleaning field. How many hits each? Why the difference?

在 EDCS 批量数据上分别对 (a) 原始 inscription 字段、(b) interpretive_cleaning 字段做正则搜索 "heic"各得几个命中?差异从何而来?

Exercise 3

Write a short paragraph (≤200 words) on what is lost by collapsing CIL X 7296's diplomatic, conservative, and interpretive transcriptions into one column for downstream analysis. Cite the EDCS field names you would prefer to keep separate, and why.

写一段 ≤200 字的短文,论 CIL X 7296 的"外交本/保守清洗/解释本"如果合并成一列做下游分析,会丢失什么。点名你希望保留为独立字段的 EDCS 列名,并说明理由。

Where to do this: Reference Edition § 4.5 (link) shows the actual SDAM Python loops you'd modify to run Exercise 2. The Case Study Issue 2 page has the side-by-side text fields for Exercise 3.

在哪做:参考版 § 4.5(链接)有真实的 SDAM Python 循环代码,可改写来跑练习 2。案例版 · 问题 2 已并排呈现练习 3 所需的文本字段。

Recap · 收束

Three sentences to take away

三句话带回家

  1. A database is not a digital copy of a stone — it is a set of editorial bets about which fields to record. Knowing the bets is the first move in honest analysis.
  2. 数据库不是石头的数字副本,它是 关于"该记哪些字段"的一组编辑式赌注。看清这些赌注,是诚实分析的第一步。
  3. No single database covers everything. Even where one stone appears in five, the five disagree quietly, and a sixth (a photo repository) doesn't transcribe at all.
  4. 没有任何一个数据库能覆盖一切。即便一块石头同时出现在 五个 库中,五者之间也存在静默分歧;第六个(照片库)甚至根本没有转写。
  5. Database disagreement is not a flaw to hide — it is evidence about how the data was made, and it is the most honest part of the corpus.
  6. 数据库间的分歧不是要遮掩的缺陷,它是 关于数据如何被制造 的证据,也是整套语料库中最诚实的那部分。

Where next: the Case Study walks one inscription through seven concrete issues. The Visual edition introduces the SDAM ETL pipelines that produce the cleaned data. The Reference covers all 37 SDAM repositories. The Paper walks the JDH 2021 article that proposes the methodology.

下一站:案例版 把一块铭文沿着七个具体问题走一遍。视觉版 介绍 SDAM ETL 管道。参考版 覆盖 SDAM 全部 37 个仓库。论文版 沿着 JDH 2021 一文走它提出的方法论。

Sources · 资料来源

Six landings for one stone

同一块石头的六个落点

Bulk data sources

批量数据来源

All counts and field-level findings on the previous slides were computed directly against the bulk data dumps in the SDAM repository — not summarised from secondary sources.

前面页中的所有计数与字段级发现,均直接从 SDAM 仓库中的批量数据导出文件计算得来,并非援引二手资料。