SDAM ETL Paper论文导读

Inscriptions as data

铭文 作为数据

Petra Heřmánková · Vojtěch Kaše · Adéla Sobotková
Petra Heřmánková · Vojtěch Kaše · Adéla Sobotková
Heřmánková, P., Kaše, V., & Sobotková, A. (2021). Inscriptions as data: digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1). doi.org/10.1515/jdh-2021-1004
Abstract (verbatim)
摘要(原文转引)

As short texts written on a durable medium, inscriptions represent invaluable insight into past societies, their organization, cultural norms and practices. Several hundred thousand inscriptions in Greek and Latin language survived until today, providing us with a line of evidence concerning populations of large cities and rural communities of the entire Mediterranean Basin in the period between the eighth century BC and eighth century AD. Although published inscriptions have been near-completely digitized and exist in online databases, and open computational tools exist to handle large datasets, large-scale and comparative studies of inscriptions are still rare. Numerous technical and conceptual issues, such as the inconsistent resolution of spatial and temporal attributes or the incompatibility of data structures between datasets, hinder the aggregation and analysis of thousands of inscriptions. The incomplete, uncertain and complex nature of inscriptions as a historical source required us to develop a series of custom open-source tools and reproducible pipelines, enabling a macro-scale overview of epigraphic production in time and space. To illustrate the potential of quantitative studies in epigraphy, we harvest and render comparable two well-established, yet very distinct, digital collections for Latin epigraphy: Epigraphic Database Heidelberg (EDH), containing over 81,000 records and Epigraphische Datenbank Clauss-Slaby (EDCS) with over 500,000 records. Placing the datasets side-by-side, we contrast past interpretations of epigraphic habit based on limited samples with trends derived from all available data and discuss their strengths and shortcomings of each respective dataset. We assert that research communities stand to gain from extending digital infrastructures to reduce barriers to access with packages of open and reusable research tools.

作为镌刻于耐久介质上的短文,铭文 (inscriptions) 为我们了解古代社会的组织、文化规范与实践提供了珍贵线索。古希腊文与拉丁文铭文存留至今者达数十万件,覆盖公元前 8 世纪至公元 8 世纪整个地中海盆地的城邑与乡村人口。尽管已发表的铭文几乎完全数字化、存于在线数据库,开源工具亦可处理大数据集,铭文层面的大规模、比较性研究仍不多见。一系列技术与概念障碍,时空属性的分辨率参差、数据集间结构不兼容,阻碍了数千条铭文的聚合分析。铭文作为史料的不完整、不确定、复杂性,使我们必须开发一套自定义的开源工具与可复现 (reproducible) 流水线,以宏观尺度统观铭文生产 (epigraphic production) 的时空分布。为示范定量铭文学的潜力,我们采集并比对了两个建制成熟却风格迥异的拉丁铭文数据集:海德堡铭文数据库 (EDH),含 81,000 余条记录;克劳斯-斯拉比铭文数据库 (EDCS),含 50 万余条记录。把两库并置,我们将既往基于有限样本对"铭文习俗 (epigraphic habit)"的诠释,与基于全部可用数据导出的趋势作对比,并讨论各自的长短。我们主张:研究共同体应当扩展数字基础设施,以开源、可重用的工具包降低使用门槛。

§ 1The article in one paragraph一段话讲清这篇论文

The paper takes the two largest digital corpora of Latin inscriptions — EDH (≈81,000 records, peer-reviewed and richly annotated) and EDCS (≈540,000 records, larger but more uneven) — harmonizes them through a transparent, reproducible ETL pipeline, and uses the result to revisit Ramsay MacMullen's classic 1982 thesis of an "epigraphic habit" that rose with the early Empire and collapsed in late antiquity. The empirical findings are largely conservative — the rise-and-fall pattern is real — but the methodological argument is bolder: digital epigraphy can only do macro-history responsibly if dataset construction is itself a scholarly, citable, reproducible artifact rather than an undocumented black box.

本文采集了两个最大的拉丁铭文数字语料库,EDH(约 81,000 条,经同行评议、注释丰富)与 EDCS(约 540,000 条,规模更大但条目良莠不齐)—— 通过一条透明、可复现的 ETL 流水线进行统一处理,并据此重审 Ramsay MacMullen 1982 年提出的经典命题:"铭文习俗 (epigraphic habit)"在罗马早期帝国兴起、在晚期古代崩塌。实证发现整体保守,那条兴衰曲线确实存在,但方法论上的主张更为激进:只有当数据集的"建构过程本身"被作为可引用、可复现的学术成果而非黑箱时,数字铭文学才能负责任地从事宏观史。

§ 1.5The scholarly lineage this paper inherits and improves本文所继承并推进的学术谱系

Before reading the article on its own terms, it helps to see where it stands. Heřmánková, Kaše & Sobotková do not arrive at digital epigraphy fresh: they enter a discipline with a 200-year print tradition, a 40-year computational lineage, and a still-emerging consensus on what reproducible scholarship should look like. This section names the three streams the paper draws on and the specific gap it tries to close.

在按论文自身条件读之前,先看清它站在哪里。Heřmánková、Kaše 与 Sobotková 并非凭空进入数字铭文学:他们身后是 200 年的印本传统、40 年的计算谱系,以及一个仍在形成中的"何谓可复现学术"共识。本节点出三股传承,并指出本文尝试填补的具体缺口。

CIL X 7296 / IG XIV 297 — bilingual stonecutter's plaque from Palermo, the case object
A useful image to keep in mind throughout this section: CIL X 7296 / IG XIV 297 — a bilingual stonecutter's sign from Roman Palermo. Mommsen (1883) and Kaibel (1890) edited it for print. Five digital databases now hold it. The JDH 2021 paper's whole methodological argument can be tested by asking what each layer of this lineage retained and what it dropped about this stone. 本节中始终有用的一幅图:CIL X 7296 / IG XIV 297:罗马时期巴勒莫的双语石匠铺招牌。Mommsen(1883)与 Kaibel(1890)将其编入印本;现今五个数字数据库收录之。JDH 2021 这篇论文的方法论主张,整体可被这个问题检验:这条谱系的每一层,关于这块石头各保留了什么、丢失了什么?

Latin and Greek epigraphy as a modern discipline begins with the great Berlin corpus projects. The Corpus Inscriptionum Graecarum (August Boeckh, 1828–) and the Corpus Inscriptionum Latinarum (Theodor Mommsen, 1853–) committed European classical scholarship to the principle that every ancient inscription should be re-edited, dated, contextualised, and assigned a stable citation. CIL is still the master reference for Latin epigraphy 170 years after its first volume.

作为现代学科的拉丁与希腊铭文学始于柏林两大集成项目。《希腊铭文集成》(August Boeckh,1828–)与《拉丁铭文集成》Theodor Mommsen,1853–)确立了欧洲古典学的一项原则:所有古代铭文都应被重新校订、定年、置于语境中、并被赋予稳定的引用。即使在第一卷出版 170 年后,CIL 仍是拉丁铭文学的根本参照。

Three generations of scholars built this tradition

三代学者建立了这一传统

  • The founders (mid-19th century): Boeckh, Mommsen, Kaibel, Dessau. Built the master corpora; established autopsy + apparatus criticus + bibliography as the editorial standard.
  • The mid-century synthesizers (1920s–1970s): Louis Robert (Paris), Ronald Syme, A. H. M. Jones, Joyce M. Reynolds (Cambridge). Wrote the analytical literature that made inscriptions usable for ancient social history; ran the annual round-ups (BE, AE, SEG).
  • The macro-historians (1980s–1990s): Ramsay MacMullen (Yale), Werner Eck, Géza Alföldy, Greg Woolf. Asked corpus-scale questions ("did inscribing rise then fall?") that demanded counting. The 1982 MacMullen article — which Heřmánková et al. revisit — is the seminal corpus-statistics paper.
  • 奠基者(19 世纪中期):BoeckhMommsenKaibelDessau。建造了大集成;确立了"实地察看 + 异文校勘 + 参考书目"的编辑规范。
  • 中期综合者(1920s–1970s):Louis Robert(巴黎)、Ronald Syme、A. H. M. Jones、Joyce M. Reynolds(剑桥)。写出使铭文能为社会史所用的分析性文献;主持年度评论(BE、AE、SEG)。
  • 宏观史学者(1980s–1990s):Ramsay MacMullen(耶鲁)、Werner Eck、Géza Alföldy、Greg Woolf。提出语料库级的问题("刻铭是否先升后降?")—— 此类问题需要数得出来。MacMullen 1982 年那篇,即 Heřmánková 等人重审的论文,是语料库统计的开山作。

The print tradition's quiet assumption: that the printed corpus is itself the data. To do statistics on the field, one counted CIL volumes, AE entries, SEG numbers — and accepted that one's totals were partial, but at least made of citable, autopsy-grounded readings.

印本传统的默契假设:印本集成本身即数据。要做学科统计,就数 CIL 卷、AE 条目、SEG 号,接受总数的不完整,但至少由可引用、有实地察看依据的释读所构成。

§ 1.5.2 The digital turn (1980s — today)§ 1.5.2 数字转向(1980s — 今)

Digital epigraphy is roughly 40 years old. It moved through four overlapping waves, each of which solved one problem and revealed the next:

数字铭文学约有 40 年。它经过四次叠加的浪潮,每一次解决一个问题、又暴露下一个问题:

Wave 1 · 1985–2000 · Searchable text on CD-ROM第一波 · 1985–2000 · 可检索文本(光盘)
PHI Greek Inscriptions; PHI Latin; the TLGPHI 希腊铭文;PHI 拉丁;TLG
The Packard Humanities Institute distributed Greek inscriptions and the entire Latin corpus on CD-ROM with a basic search engine. For the first time, a scholar could ask "how often does X appear in Greek inscriptions?" and get an answer in seconds. What it solved: universal text search. What it revealed: there is no metadata structure; you can find the word but not the stone.
Packard 人文研究所以光盘形式发行希腊铭文与全部拉丁文语料,附基本搜索引擎。学者首次能问"X 在希腊铭文中出现多少次?"——几秒钟就有答案。解决了什么:跨语料文本检索。暴露了什么:没有元数据结构;找到的是词,不是石头。
Wave 2 · 1995–2010 · Web databases第二波 · 1995–2010 · 网络数据库
EDH (Heidelberg); EDCS (Clauss-Slaby); EDR; EDBEDH(海德堡);EDCS(Clauss-Slaby);EDR;EDB
Each major centre built its own web-based database, with structured fields (place, date, material, type), Latin-and-Greek text fields, occasionally a photograph. What it solved: queryable metadata at scale. What it revealed: every project chose its own conventions; no two databases agree on what a record should contain or how identifiers should resolve. The federation problem.
各大中心建造各自的 Web 数据库,含结构化字段(出土地、年代、材质、类型)、希腊与拉丁文本字段,偶有照片。解决了什么:可批量查询的元数据。暴露了什么:每个项目自定义惯例;没有两库在"记录该含什么、ID 该如何解析"上达成一致。联邦化问题。
Wave 3 · 2005–today · TEI EpiDoc + IIIF第三波 · 2005–今 · TEI EpiDoc + IIIF
I.Sicily, IRT, IRCyr, IGCyr, IAph, IOSPE, AshLI, MAMAI.Sicily、IRT、IRCyr、IGCyr、IAph、IOSPE、AshLI、MAMA
Driven by Gabriel Bodard, Hugh Cayless, Charlotte Roueché, Tom Elliott and the EpiDoc Collaborative. The community settled on TEI XML + Leiden-conventions-as-tags as a shared encoding standard. IIIF added deep-zoom imagery. What it solved: editorial markup is now machine-readable; cross-project interoperability becomes possible in principle. What it revealed: deep encoding takes editorial labour that does not scale; aggregator databases (EDCS, PHI) remained outside this standard.
Gabriel BodardHugh CaylessCharlotte RouechéTom Elliott 及 EpiDoc 协作组推动。社区共同采用"TEI XML + Leiden 惯例转标签"作为共享编码标准;IIIF 加入了深缩放图像。解决了什么:编辑标记可被机器读取;项目间互操作原则上成为可能。暴露了什么:深度编码所需的编辑投入难以规模化;聚合器(EDCS、PHI)仍在标准之外。
Wave 4 · 2015–today · Open data + reproducibility第四波 · 2015–今 · 开放数据 + 可复现
SDAM, papyri.info, Trismegistos, Ithaca / iPHISDAM、papyri.info、Trismegistos、Ithaca / iPHI
As open-data norms spread (FAIR, GitHub, CC licences, persistent identifiers, Zenodo DOIs), digital epigraphy became answerable to computational reproducibility — the version of openness pioneered by physics + biology. The methodological question shifted: not "do we have the data?" but "can someone re-run our analysis and get our numbers?" What it solved: the dataset itself becomes a citable scholarly object. What it revealed: the methodological vocabulary of computational science (regression, propagation of uncertainty, sensitivity analysis) had not yet entered classical-studies practice.
随着开放数据规范扩散(FAIR、GitHub、CC 授权、持久标识符、Zenodo DOI),数字铭文学开始要回答计算性的可复现,物理与生物学领域率先确立的"开放"。问题改变了形态:不再是"我们有数据吗",而是"别人能重新跑你的分析得到你的数字吗"?解决了什么:数据集本身成为可引用的学术对象。暴露了什么:计算科学的方法论词汇(回归、不确定性传播、敏感性分析)尚未进入古典学实践。

The point: Heřmánková, Kaše & Sobotková arrive at the cusp of Wave 3 ↔ Wave 4. They use the encoding work of Wave 3 (TEI EpiDoc, structured EDH metadata) and explicitly bring the methodological commitments of Wave 4 (Monte Carlo, ETL transparency, parquet open distribution) into Latin epigraphy.

关键:Heřmánková、Kaše 与 Sobotková 站在第三波与第四波交界处。他们使用第三波的编码工作(TEI EpiDoc、结构化的 EDH 元数据),并显式把第四波的方法论承诺(蒙特卡洛、ETL 透明、parquet 开放分发)带入拉丁铭文学。

§ 1.5.3 Where Heřmánková, Kaše & Sobotková improve the scholarship§ 1.5.3 Heřmánková、Kaše 与 Sobotková 在哪里推进了学术

The paper's specific contribution is sharper than "we redid MacMullen with bigger data." It is a methodological synthesis. Five concrete moves the authors make that did not exist together before:

本文的具体贡献比"我们用更大数据重做了 MacMullen"要锐利得多,是一次方法论综合。作者做了五项具体动作,此前没有任何研究把这五项同时做齐:

Move动作What it does作用Inheriting from承自Improving on改进了
1. Open ETLEvery step of cleaning is a code commit每一步清洗都是一次代码提交Wave 4 (FAIR, software-engineering standards)第四波(FAIR、工程化标准)EDCS-only workflows where data origin is opaque仅用 EDCS 的工作流,数据来源不透明
2. Tempun Monte CarloDate uncertainty is sampled, not collapsed日期不确定性被抽样而非塌缩为单点Bayesian dating in archaeology (Bronk Ramsey, Buck)考古学中的贝叶斯定年(Bronk Ramsey、Buck)Earlier counting that pinned each inscription to a single midpoint year把每条铭文钉到单一中点年的早期计数
3. Two-database honest comparisonEDH and EDCSx counted side-by-side; the gap is the findingEDH 与 EDCSx 并排计数;差距即发现Open-source benchmark culture (Kaggle, MLPerf)开源基准文化(Kaggle、MLPerf)Studies that pick one database and report a single number仅用一个库、报告一个数字的研究
4. Pleiades georeferencingEvery record has a stable Pleiades URI for its place每条记录的地名都附稳定的 Pleiades URIAncient World Mapping Center / Pelagios古代世界制图中心 / PelagiosFree-text place names that don't join across datasets不能跨库联接的自由文本地名
5. Methodological reflection in the journal mediumJDH 3-layer format makes method visibleJDH 三层式让方法可见Replication / pre-registration movement复现 / 预注册运动Conventional articles where method is a paragraph"方法仅占一段"的传统论文

In one sentence

一句话

The improvement is not that the answers are different from MacMullen's — they mostly aren't — but that the path from question to answer is now itself a checkable scholarly object. Earlier macro-history asked you to trust the historian's count; this paper asks you to re-run their pipeline.

改进并不在于答案不同于 MacMullen,大多数答案相同,而在于"从问题到答案的那条路径"本身现在是一个可被检查的学术对象。早期宏观史要求你信任史家的计数;本文要求你重跑他们的管道。

Adjacent contemporaries who matter: Roger Bagnall (papyrology, scale, demographic estimation), John Bodel (US Epigraphy Project, methodological reflection), Anne Mahoney and Gabriel Bodard (Stoa, EpiDoc), Sebastian Heath, Hugh Cayless. The JDH 2021 paper sits in this conversation, not above it.

相邻同代要紧的人物:Roger Bagnall(莎草纸学、规模、人口估算)、John Bodel(美国铭文项目、方法论反思)、Anne MahoneyGabriel Bodard(Stoa、EpiDoc)、Sebastian Heath、Hugh Cayless。JDH 2021 这篇论文在此对话之中,而非凌驾其上。

§ 2The JDH three-layer formatJDH 的"三层式"论文形态

Before the article's content, a word about its form. This paper appeared in the inaugural issue of the Journal of Digital History, which uses a deliberate three-layer publication format. Click each layer:

在内容之前,先说论文的形式。本文发表于《数字史学杂志》(Journal of Digital History, JDH)创刊号;该期刊采用刻意设计的三层式发表形态。点击每一层:

Layer 1 · Narrative第一层 · 叙事层
The article you read读者读到的"那篇论文"
Prose, figures, citations — what humanists recognize as a journal article. This walkthrough is mostly about this layer.
散文、图表、引证,人文学者熟悉的"期刊论文"形态。本导读主要讲这一层。
Layer 2 · Hermeneutic第二层 · 诠释层
Methodological reflection方法论自省
The authors comment in parallel on how they did the work — what choices they made and why. This is what makes JDH unusual: a structured space to examine method as it unfolds.
作者在同一文本中并行评论"如何"完成这项研究,做了哪些抉择、为什么。这是 JDH 的特别之处:留出一个结构化空间,伴随研究展开来检视方法。
Layer 3 · Data第三层 · 数据层
Executable Jupyter notebooks可执行的 Jupyter 笔记本
The actual code that produced every figure and number. You can re-run it, change a parameter, get a different chart. sdam-au/digital_epigraphy contains it.
生成每一张图与每一个数字的真实代码。你可以重新运行、改参数、得到不同的图。代码在 sdam-au/digital_epigraphy

The three-layer format is not decoration — it is part of the argument the article makes. If the paper's claim is that data construction must itself be transparent and reusable, then the journal article must be a kind of object you can inspect, re-run, and modify. The medium enacts the message.

三层式不只是装饰,它本身是论文的一部分主张。如果论文要主张"数据构建必须透明、可重用",那这篇期刊论文就必须是可被检查、可被重跑、可被修改的对象。媒介本身就是讯息。

§ 3The research questions研究问题

The paper interleaves two questions — one empirical, one methodological — and the second is more important.

论文同时追问两个问题,一是经验性、一是方法论,后者更为关键。

Empirical question经验性问题
What does the long-term temporal, geographic, and typological distribution of Latin inscriptions look like at full corpus scale?
在全语料尺度下,拉丁铭文的长期时空与类型分布到底是什么样?

Concretely: does MacMullen's "epigraphic habit" curve — rise in the early Empire, peak in the late 2nd / early 3rd c. CE, sharp decline through Late Antiquity — survive when computed not from a hand-picked sample but from every inscription EDH and EDCS hold?

具体而言:MacMullen 提出的"铭文习俗"曲线,帝国早期上升、2 世纪末至 3 世纪初达顶峰、晚期古代陡降,当我们用 EDH 与 EDCS 全部铭文(而非手挑样本)计算时,是否依然成立?

The answer the paper gives is: yes, broadly. The shape is robust to switching between EDH and EDCS, robust to using midpoint vs probabilistic dating, and robust to filtering to only well-located, well-dated inscriptions. But the amplitude and the tails differ between datasets, and that difference is itself informative — it reveals the editorial fingerprints of each database.
论文给出的回答是:大体上,是的。曲线形状在 EDH 与 EDCS 之间稳健,在中点定年与概率定年之间稳健,在筛选到"位置与日期都明确"的子集后依然稳健。但两库在振幅两端尾部上仍有差异,而这差异本身就有信息量,它揭示了各库的编辑指纹。
Methodological question方法论问题
What does it mean — epistemologically — to treat humanistic source material as data?
把人文史料当作"数据",在认识论上意味着什么?

Inscriptions are not raw observations of antiquity. They were carved, found, transcribed, edited, restored, dated, indexed, digitized, and finally exposed via API or HTML — every link in that chain is an interpretive act. What does it cost to forget that?

铭文并不是古代世界的"原始观察"。它们经过镌刻、发掘、转录、编校、修复、定年、编目、数字化,最终通过 API 或 HTML 接口暴露,这条链条上每一环都是一次诠释行为。如果我们忘了这一点,会付出什么代价?

The paper's answer: macro-historical claims about ancient inscriptions are only as good as the documentation of how the underlying dataset was constructed. The paper therefore commits an unusual amount of space to the data-construction story — the ETL pipelines, the probabilistic dating method, the choice to keep originals alongside cleaned values — and treats that documentation as a contribution co-equal with the substantive findings.
论文的回答:关于古代铭文的宏观史结论,质量取决于"底层数据集是如何构建的"这件事被记录得有多清楚。因此论文用了相当多的篇幅讲数据构建这件事,ETL 流水线概率定年方法、保留原值与清洗值并存的设计,并把这一记录当作与实证发现地位相当的贡献。

§ 4Two databases, side by side两个数据库并置

The paper does not treat EDH and EDCS as interchangeable suppliers of "Latin inscriptions." It treats them as different scholarly artifacts with different editorial commitments, different scope, and therefore different blind spots.

论文并不把 EDH 与 EDCS 当作可互换的"拉丁铭文供应商"。它把它们看作不同的学术成品:编辑取向不同、覆盖范围不同、因而盲区也不同。

EDH · Heidelberg

peer-reviewed · richly annotated同行评议 · 标注丰富
~81k
inscriptions条铭文

Curated by a team at Heidelberg University. Each inscription is editorially reviewed, dated against published scholarship, georeferenced via the Pleiades gazetteer, and exposed through a public JSON API and downloadable EpiDoc XML. (EDH belongs to Family 1 — aggregators; for the broader six-family map see the Atlas.)

由海德堡大学的团队编校。每条铭文都经编辑审定、依发表的学术成果定年、通过 Pleiades 古地名词典进行地理参引,并通过公共 API 与可下载的 EpiDoc XML 同时开放。(EDH 属于 家族一 · 聚合器;六家族总览见 数据库地图。)

  • Strong on western provinces (Italy, Gaul, Germania, Pannonia)
  • High data quality, low quantity
  • Easy to script against
  • 对西方诸行省(意大利、高卢、日耳曼尼亚、潘诺尼亚)覆盖较强
  • 质量高、数量小
  • 易于程序化访问

EDCS · Clauss/Slaby

comprehensive · light editorial覆盖广 · 编辑较浅
~540k
inscriptions条铭文

Compiled and maintained by Manfred Clauss and Anne Kolb. Aspires to ingest everything published in the major Latin-epigraphy print corpora (CIL, AE, etc.). Larger but with less per-record curation, and exposed only as a public-search HTML interface — no API.

由 Manfred Clauss 与 Anne Kolb 编纂维护。目标是收纳已发表于各大拉丁铭文丛刊(CIL、AE 等)的全部条目。规模更大,但单条精修较浅,且仅以公共搜索 HTML 界面对外开放,没有 API

  • Maximum coverage including the eastern Empire
  • Inconsistent dating granularity across centuries-long ranges
  • Must be scraped (Lat Epig 2.0 takes 4–5 hours)
  • 覆盖最大,含东方诸行省
  • 定年颗粒度参差,常见跨世纪范围
  • 只能抓取(Lat Epig 2.0 工具需 4–5 小时)

A third entity appears in the paper's figures: EDCSx. This is the authors' filtered subset of EDCS, restricted to inscriptions that are dated and located precisely enough to compare meaningfully with EDH. The paper's most carefully argued figures use EDCSx, not raw EDCS — because applying EDH-quality filters to EDCS makes the two corpora actually comparable.

论文图中还出现一个第三方实体:EDCSx。这是作者从 EDCS 筛出的子集,限定为"定年与定位都足够精细,能够与 EDH 公平对比"的条目。论文最严谨的图表用的是 EDCSx,不是原始 EDCS,因为只有把"EDH 质量"的筛子套到 EDCS 上,两库才真正可比。

Why this matters.为何重要。 The choice between "use EDCS as-is to maximize sample size" and "filter to EDCSx to ensure comparability" is exactly the kind of methodological decision the paper wants to expose, not hide. By documenting both, it lets the reader see how the conclusions depend on that choice. 在"按原貌使用 EDCS 以求样本最大"和"筛到 EDCSx 以保可比性"之间作出的选择,正是论文希望暴露而非隐藏的方法论决定。把两者都展示出来,读者就能看到结论对该选择的依赖程度。

§ 5From databases to data从数据库到数据

Roughly half of the article is devoted to how EDH and EDCS were turned into a single comparable analysis-ready table. This is unusual for a humanities paper and is the article's central methodological move.

论文约一半篇幅讲述如何把 EDH 与 EDCS 化为一张可分析的可比表格。在人文论文中这并不常见,也是该文最重要的方法论举措。

A historical note.一个历史脚注。 The paper's data is the latest layer in a transcription chain that began in the 18th century: physical stones → handwritten transcripts (Torremuzza, Ignarra) → 19th-c. print corpora (CIL Mommsen 1883, IG Kaibel 1890) → 20th-c. revisions → modern digital databases → SDAM cleaning. Every layer kept some choices and discarded others. The case study walks through one inscription's full chain. 论文的数据是 18 世纪以来一长串转录的最后一层:实物石头 → 手抄本(Torremuzza、Ignarra)→ 19 世纪印本(CIL Mommsen 1883、IG Kaibel 1890)→ 20 世纪修订 → 现代数字数据库 → SDAM 清洗。每一层都保留某些抉择、丢弃另一些。案例研究追踪了一块铭文的完整链条。

The four moves

四个动作

  1. Extract. Walk EDH's API; scrape EDCS province by province with Lat Epig 2.0; harvest the EpiDoc XML dumps to recover dating prose the API has flattened.
  2. Transform. Standardize dates as integer ranges; resolve places against Pleiades; clean inscription text in two variants (interpretive + conservative); harmonize incompatible inscription-type taxonomies.
  3. Load. Publish two artifacts on Zenodo with separate DOIs — one for the dataset, one for the scripts — and mirror them to a public sciencedata.dk folder for unauthenticated read access.
  4. Date probabilistically. Treat each inscription's date range as a probability distribution. Draw thousands of Monte Carlo samples per record. Aggregate. This is what later becomes the tempun package.
  1. 提取(Extract):遍历 EDH 的 API;用 Lat Epig 2.0 按行省抓取 EDCS;下载 EpiDoc XML 转储,找回 API 已经压扁掉的日期文字描述。
  2. 转换(Transform):把日期标准化为整数区间;用 Pleiades 消解地名;分两个版本清洗铭文文本("诠释版"与"保守版");统一两库不兼容的铭文类型分类。
  3. 加载(Load):在 Zenodo 发布两个独立 DOI,一个给数据集、一个给脚本,并镜像到 sciencedata.dk 公共文件夹供免登录读取。
  4. 概率定年:把每条铭文的日期区间视为概率分布。每条记录抽取数千个蒙特卡洛 (Monte Carlo) 样本,再做汇总。这就是后来成为 tempun 包的那套方法。

§ 6Findings — the seven figures, walked through研究发现,逐图解读七张图

The article's substantive analysis is carried by seven figures. Each is reproduced from the companion notebook and visualized below in stylized form. Click controls to switch views.

论文的实证分析由七张图支撑。每张图都来自配套笔记本,下文以风格化形式重绘。点击控件切换视图。

Fig 1 — The shape of the epigraphic habit图 1,铭文习俗的曲线

A single curve showing how many Latin inscriptions are dated to each year, summed across the entire corpus and weighted by probabilistic dating. The shape is the central empirical object of the paper.

单条曲线,把整个语料按概率定年加权后,逐年求和铭文数量。这条曲线是论文的核心实证对象。

200 BCE 1 CE 200 400 600 800 CE INSCRIPTIONS / YEAR peak: late 2nd / early 3rd c.
FIG 1 Stylized reconstruction of the paper's central temporal curve. Inscription production rises through the late Republic, peaks in the late 2nd / early 3rd c. CE, and declines steeply through Late Antiquity. The shape is robust across both EDH and EDCS, but the absolute counts differ. 论文核心时间曲线的风格化重绘。铭文生产在共和晚期上升,于公元 2 世纪末至 3 世纪初达到峰值,晚期古代陡然下落。曲线形状在 EDH 与 EDCS 之间稳健,但绝对计数不同。
Key claim关键论点
MacMullen's "epigraphic habit" pattern is real, even at full corpus scale.
即使在全语料尺度下,MacMullen 的"铭文习俗"曲线依然成立。

The rise-peak-fall shape is not an artifact of the small samples MacMullen worked with in 1982 — it persists across all 600k+ Latin inscriptions in EDH and EDCS combined.

这条"先升后降"的曲线并非 MacMullen 1982 年所用小样本的伪迹,把 EDH 与 EDCS 全部 60 万余条拉丁铭文加在一起,曲线依然如是。

However: the amplitude of the curve depends on what counts as "an inscription dated to year X". Probabilistic dating spreads each range over its full span, producing smoother curves than midpoint dating. The peak height itself is therefore not a single number — it's a probability distribution. This is exactly why tempun exists.
不过:曲线的振幅取决于"哪条铭文算作 X 年"的判定。概率定年把每段区间均匀铺开,曲线比中点定年更平滑;峰值高度不是一个数字,而是一个概率分布。这正是 tempun 存在的理由。

Fig 2 — Inscription types over time图 2,铭文类型随时间变化

Same temporal axis, but now broken down by kind of inscription: epitaphs vs honorific dedications vs votive offerings vs building inscriptions. Are the rise-and-fall patterns the same for all types? They are not.

仍以时间为横轴,但按类型拆分:墓志、荣誉献辞、祝愿献辞、建筑铭文。各类型曲线是否同步?答案是:不。

200 BCE 1 CE 200 400 600 800 CE epitaph (dominant) honorific votive building
FIG 2 Stylized: epitaphs dominate the corpus and drive the overall shape; honorific and votive inscriptions track loosely with the epitaph curve but with different peak shapes. Building inscriptions are a much smaller, flatter category. EDH and EDCS broadly agree on the relative ordering. 风格化重绘:墓志主导整个语料,整体曲线形状由其驱动;荣誉与祝愿铭文与墓志大致同步,但峰形不同。建筑铭文规模小得多、曲线更平。EDH 与 EDCS 在相对次序上基本一致。
Key claim关键论点
"Latin inscriptions" is mostly "Roman epitaphs."
所谓"拉丁铭文",绝大多数是"罗马墓志"。

Across both corpora, funerary inscriptions outnumber every other category combined. Any aggregate analysis of "the epigraphic habit" is therefore largely an analysis of Roman commemorative funerary practice.

在两库中,葬仪类铭文的数量都超过其他所有类型之和。因此任何关于"铭文习俗"的整体分析,本质上都主要是在分析罗马的纪念性葬仪实践

Fig 3 — Periodized comparison: EDH vs EDCSx图 3,分期比较:EDH 与 EDCSx

Same data, but bucketed into chronological phases (Republic / Early Imperial / High Imperial / Late Imperial / Late Antique) and shown side-by-side. This is where the editorial fingerprint of each database becomes visible.

同样的数据,但按历史分期(共和 / 早期帝国 / 盛期帝国 / 晚期帝国 / 晚期古代)分桶并并置展示。两库的"编辑指纹"在此显形。

FIG 3 Stylized period buckets, percentages relative to each corpus's total. EDH skews lighter into Late Imperial / Late Antique; EDCSx covers a slightly broader temporal range with messier dating granularity. 风格化的时期分桶,百分比相对各自语料总量。EDH 在晚期帝国/晚期古代占比较低;EDCSx 时段稍广,但定年颗粒度较杂。

Figs 4–6 — Where the inscriptions are图 4–6,铭文在哪里

Three province-distribution charts, one per dataset version: EDH, EDCS (all), EDCSx (filtered). Italy and the city of Rome dominate every version, but the rank order of the next provinces shifts in revealing ways.

三张行省分布图,分别对应三个数据集版本:EDH、EDCS(全部)、EDCSx(筛选后)。无论哪一版,意大利与罗马城都居首;但其后行省的次序变化富有信息量。

FIGS 4 · 5 · 6 Stylized top-10 provinces by inscription count. EDH highlights Italy, the western provinces, and the limes regions; EDCS (all) is more even toward the eastern Empire; EDCSx — when filtered to comparable dating quality — looks more like EDH. 风格化的"行省前 10"。EDH 凸显意大利、西方诸省与边境地带;EDCS(全部)更均衡地覆盖东方诸省;EDCSx 在筛选到可比的定年质量后,更接近 EDH 的形貌。
Key claim关键论点
Geographic coverage is profoundly uneven — and the unevenness is partly modern, not ancient.
地理覆盖极不均衡,而不均衡是部分"现代造成",并非全是"古代如此"。

Italy and the city of Rome dominate every dataset, but so do the Rhine and Danube limes. This reflects both the genuine ancient distribution of inscribing communities and the modern history of where archaeologists looked, where 19th- and 20th-century epigraphic editions were prepared, and which regions ended up in databases first.

在每个数据集中,意大利与罗马都居首;莱茵河与多瑙河边境也居前。这既反映古代真实的"题写社群"分布,也反映现代史的影响,考古学者在哪里发掘、19–20 世纪铭文丛刊在哪里编校、哪些地区先进入数据库。

Fig 7 — Rome zoomed: types over time, in the imperial capital图 7,罗马城放大:帝国首都的类型随时间变化

A typologies-over-time chart, but restricted to Rome itself (using EDCSx). The point is to test whether the all-empire pattern holds in the densest single locality. It mostly does — but with sharper amplitude and faster late-antique decline.

一张"类型随时间"的曲线,但限定在罗马城(使用 EDCSx)。目的是测试帝国整体的模式在最密集的单一地点是否依然成立。基本成立,但振幅更大,晚期古代下落更快。

[Stylized rendering of Fig7_Rome_Typologies_comparison_time_EDCSx.png — the same four-line plot as Fig 2, but for Rome only. Epitaph dominance is even more pronounced than in the empire-wide aggregate.]

[此处为 Fig7_Rome_Typologies_comparison_time_EDCSx.png 的风格化呈现,与图 2 同样的四条曲线,但仅限罗马城。墓志的主导性比帝国整体还要明显。]

§ 7The methodological argument方法论主张

If you take only one thing from the paper, take this: the substantive findings are not the most important contribution. The most important contribution is a stance on what doing digital ancient history responsibly looks like.

如果只能从论文带走一件事,请带走这件:实证发现并不是最重要的贡献。最重要的贡献是一种姿态:负责任的数字古史学应该长什么样。

"We assert that research communities stand to gain from extending digital infrastructures to reduce barriers to access with packages of open and reusable research tools." "我们主张:研究共同体应当扩展数字基础设施,以开源、可重用的工具包降低使用门槛。" — Heřmánková, Kaše & Sobotková 2021, abstract ,Heřmánková, Kaše & Sobotková 2021, 摘要
Stance 1 · Reproducibility立场一 · 可复现
Every figure must be re-derivable from the same code and data.
每张图都必须能由同一份代码与数据重新生成。

The paper's figures are not screenshots — they are outputs of the open Jupyter notebooks in sdam-au/digital_epigraphy. Anyone can re-run them, change a parameter, and get a different version. This is the operational meaning of FAIR data.

论文的图不是截图,而是 sdam-au/digital_epigraphy 中开源 Jupyter 笔记本的运行结果。任何人都可以重跑、改参数、得到另一版结果。这就是 FAIR 数据 的可操作含义。

Stance 2 · Honest uncertainty立场二 · 诚实的不确定性
Dating uncertainty must be propagated, not hidden.
日期不确定性必须传播,不能被掩盖。

Midpoint dating is convenient and silently distorts the curves. The paper introduces (and the team's tempun package later formalizes) probabilistic dating: every inscription contributes a probability distribution across its date range. The resulting curves are smoother and more honest about what we don't know.

中点定年方便,但默默扭曲曲线。论文引入(团队后来在 tempun 包中正式化)概率定年:每条铭文以一个分布参与其日期区间。最终曲线更平滑,也更如实地承认"我们不知道的部分"。

Stance 3 · Dataset bias is scholarly content立场三 · 数据集偏差也是学术内容
Differences between EDH and EDCS are not noise to be averaged out.
EDH 与 EDCS 的差异不是要被平均掉的"噪声"。

When the two corpora disagree, that disagreement reveals something about each one's editorial history. Showing both side-by-side — rather than picking one and pretending it's "the corpus" — is itself a scholarly contribution.

当两库的结论不一致时,这种不一致揭示了各自的编辑史。把两库并置展示(而不是挑一个、当作"那个语料")本身就是一项学术贡献。

§ 8Strengths and limits优势与局限

The paper is admirably honest about what its corpus can and cannot say.

论文对自身语料"能讲什么、不能讲什么"相当坦诚。

What the paper establishes

论文证立的事

  • The MacMullen rise-and-fall pattern survives at full corpus scale.
  • Funerary epitaphs dominate every type-distribution at every period.
  • EDH and EDCS, though different, agree on the gross shape of the curve.
  • Italy + Rome + the western limes dominate geographic distribution.
  • Dataset construction can be made transparent, reproducible, and citable.
  • MacMullen 的"先升后降"在全语料尺度依然成立。
  • 葬仪墓志在所有时期、所有类型分布中都占主导。
  • EDH 与 EDCS 虽不同,但在曲线整体形状上一致。
  • 意大利 + 罗马 + 西方边境主导地理分布。
  • 数据集构建可以做到透明、可复现、可引用。

What the paper does NOT establish

论文没有证立的事

  • That the curve reflects ancient reality more than modern collection bias.
  • That the absolute numerical heights of the curve are reliable.
  • That Greek inscriptions follow the same pattern (this paper is Latin-only).
  • That eastern provinces are well-represented (they are not).
  • That an inscription's date is more than a probability over centuries.
  • 该曲线"反映古代现实多于现代收藏偏差"。
  • 曲线的绝对高度是可靠的。
  • 希腊铭文是否遵循同一模式(本文仅限拉丁)。
  • 东方诸省被良好代表(实际并非如此)。
  • 某条铭文的日期不只是一段世纪级的概率分布。

§ 9Implications意义

For digital classics. The paper sets a benchmark: future macro-historical claims about ancient inscriptions should come bundled with an open ETL pipeline, an explicit dating-uncertainty model, and a versioned dataset DOI. Anything less is harder to take seriously.

对数字古典学:论文树立了一个标杆,今后任何关于古代铭文的宏观史断言,都应同时附上开源的 ETL 流水线、明确的日期不确定性模型,以及带版本号的数据集 DOI。低于这一标准的工作将更难令人信服。

For ancient history more broadly. The paper extends well beyond epigraphy: it suggests that the unit of scholarly work in digital history should include the data engineering, not bracket it as "preliminary." The SDAM follow-on projects (Greek inscriptions, Greek texts, Bulgarian burial mounds) demonstrate this is a generalizable stance, not specific to Latin epigraphy.

对更宏观的古史:论文意涵远超铭文学,它主张数字史学的"学术单位"应当把数据工程内置,而不是把它当作"预备阶段"括号掉。SDAM 项目的后续工作(希腊文铭文、希腊文献、保加利亚坟丘)证明这一姿态可推广,并不限于拉丁铭文学。

For data-as-method debates. Treating sources "as data" is contested in the humanities. The paper's answer is not "yes you can" or "no you can't" — it is "you can if you make every transformation visible." The transformation chain is the scholarship.

对"数据即方法"的论争:在人文学科里,"把史料当作数据"的做法颇有争议。论文的回答既不是简单的"可以",也不是"不可以",而是"如果你把每一次变换都暴露出来,就可以"。变换链本身就是学术。

§ 10How to engage with the paper如何深入研习

  1. Read the article on JDHdoi.org/10.1515/jdh-2021-1004. The three-layer format makes the most sense in the JDH player itself.
  2. Browse the Visual editionsdam-visual-slideshow.html — for an interactive, button-driven tour of the ETL machinery underlying every figure in the paper.
  3. Open the Reference editionsdam-reference.html — for a complete technical companion, including the full SDAM repository map and code samples.
  4. Run the notebook — clone sdam-au/digital_epigraphy and re-execute every figure. Try changing the dating method to midpoint and see how Fig 1's amplitude shifts.
  5. Use the data — load EDH_text_cleaned_2022_11_03.json directly via pandas.read_json() from the public sciencedata.dk URL and start your own analysis.
  6. Cite — both the article DOI and the dataset DOI 10.5281/zenodo.7303886.
  1. 在 JDH 上阅读论文doi.org/10.1515/jdh-2021-1004。三层式形态在 JDH 自家的阅读器里效果最好。
  2. 浏览视觉版sdam-visual-slideshow.html,对论文每一张图背后 ETL 机器进行按钮驱动的交互式游览。
  3. 打开参考版sdam-reference.html,完整的技术配套,含 SDAM 全部仓库索引与代码示例。
  4. 跑一遍笔记本:克隆 sdam-au/digital_epigraphy 并重新执行每张图。试把定年方式改为中点,看图 1 的振幅如何变化。
  5. 使用数据:通过 pandas.read_json() 从 sciencedata.dk 公共 URL 直接加载 EDH_text_cleaned_2022_11_03.json,开始自己的分析。
  6. 引用:论文 DOI 与数据集 DOI 10.5281/zenodo.7303886 同时引用。

Two-line code, no clone needed

两行代码,无需克隆

import pandas as pd
EDH = pd.read_json("https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/EDH_text_cleaned_2022_11_03.json")
# 81,883 inscriptions ready

§ 11Bibliography & further reading参考文献与延伸阅读

The paper itself

本文

  1. Heřmánková, P., Kaše, V., & Sobotková, A. (2021). Inscriptions as data: digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1). doi.org/10.1515/jdh-2021-1004

Companion code & data

配套代码与数据

  1. sdam-au/digital_epigraphy — companion Jupyter notebook
  2. Heřmánková, P., & Kaše, V. (2022). EDH_text_cleaned_2022_11_03 (v2.0) [Data set]. Zenodo. 10.5281/zenodo.7303886
  3. Heřmánková, P. (2022). EDCS_text_cleaned_2022_09_12 (v2.0) [Data set]. Zenodo. 10.5281/zenodo.7072337
  4. Kaše, V., Sobotková, A., & Heřmánková, P. (2023). Modeling Temporal Uncertainty in Historical Datasets. CHR 2023. CEUR-WS

The intellectual lineage

学术谱系

  1. MacMullen, R. (1982). The epigraphic habit in the Roman Empire. American Journal of Philology, 103(3), 233–246. — the founding article on the rise-and-fall pattern.—— "兴衰曲线"的奠基文。
  2. Bodel, J. (Ed.). (2001). Epigraphic Evidence: Ancient History from Inscriptions. Routledge. — the standard methodological introduction.—— 标准方法论导论。

Source databases

数据源

  1. Epigraphic Database Heidelberg — edh.ub.uni-heidelberg.de
  2. Epigraphik-Datenbank Clauss/Slaby — manfredclauss.de
  3. Pleiades gazetteer — pleiades.stoa.org

Companion editions in this trilogy

本三部曲的配套版本

  1. Visual Editionan interactive slideshow on the SDAM ETL pipelines, button-driven, beginner-friendly.SDAM ETL 流水线的交互式幻灯片,按钮驱动,面向初学者。
  2. Reference Editiona deep technical companion documenting all 37 SDAM repositories.深度技术配套,记录全部 37 个 SDAM 仓库。
  3. Landing pageoverview of the trilogy with EN/中文 toggle.三部曲总览,含中英切换。