Inscriptions as data
铭文 作为数据
Abstract (verbatim)
摘要(原文转引)
As short texts written on a durable medium, inscriptions represent invaluable insight into past societies, their organization, cultural norms and practices. Several hundred thousand inscriptions in Greek and Latin language survived until today, providing us with a line of evidence concerning populations of large cities and rural communities of the entire Mediterranean Basin in the period between the eighth century BC and eighth century AD. Although published inscriptions have been near-completely digitized and exist in online databases, and open computational tools exist to handle large datasets, large-scale and comparative studies of inscriptions are still rare. Numerous technical and conceptual issues, such as the inconsistent resolution of spatial and temporal attributes or the incompatibility of data structures between datasets, hinder the aggregation and analysis of thousands of inscriptions. The incomplete, uncertain and complex nature of inscriptions as a historical source required us to develop a series of custom open-source tools and reproducible pipelines, enabling a macro-scale overview of epigraphic production in time and space. To illustrate the potential of quantitative studies in epigraphy, we harvest and render comparable two well-established, yet very distinct, digital collections for Latin epigraphy: Epigraphic Database Heidelberg (EDH), containing over 81,000 records and Epigraphische Datenbank Clauss-Slaby (EDCS) with over 500,000 records. Placing the datasets side-by-side, we contrast past interpretations of epigraphic habit based on limited samples with trends derived from all available data and discuss their strengths and shortcomings of each respective dataset. We assert that research communities stand to gain from extending digital infrastructures to reduce barriers to access with packages of open and reusable research tools.
作为镌刻于耐久介质上的短文,铭文 (inscriptions) 为我们了解古代社会的组织、文化规范与实践提供了珍贵线索。古希腊文与拉丁文铭文存留至今者达数十万件,覆盖公元前 8 世纪至公元 8 世纪整个地中海盆地的城邑与乡村人口。尽管已发表的铭文几乎完全数字化、存于在线数据库,开源工具亦可处理大数据集,铭文层面的大规模、比较性研究仍不多见。一系列技术与概念障碍 —— 时空属性的分辨率参差、数据集间结构不兼容 —— 阻碍了数千条铭文的聚合分析。铭文作为史料的不完整、不确定、复杂性,使我们必须开发一套自定义的开源工具与可复现 (reproducible) 流水线,以宏观尺度统观铭文生产 (epigraphic production) 的时空分布。为示范定量铭文学的潜力,我们采集并比对了两个建制成熟却风格迥异的拉丁铭文数据集:海德堡铭文数据库 (EDH),含 81,000 余条记录;克劳斯-斯拉比铭文数据库 (EDCS),含 50 万余条记录。把两库并置,我们将既往基于有限样本对"铭文习俗 (epigraphic habit)"的诠释,与基于全部可用数据导出的趋势作对比,并讨论各自的长短。我们主张:研究共同体应当扩展数字基础设施,以开源、可重用的工具包降低使用门槛。
In this walkthrough
本导读包含
- The article in one paragraph
- 一段话讲清这篇论文
- The JDH three-layer format
- JDH 的"三层式"论文形态
- The research questions
- 研究问题
- Two databases, side by side
- 两个数据库并置
- From databases to data
- 从数据库到数据
- The seven figures, walked through
- 逐图解读 七张图
- The methodological argument
- 方法论主张
- Strengths and limits
- 优势与局限
- Implications
- 意义
- How to engage further
- 深入研习指引
§ 1The article in one paragraph一段话讲清这篇论文
The paper takes the two largest digital corpora of Latin inscriptions — EDH (≈81,000 records, peer-reviewed and richly annotated) and EDCS (≈540,000 records, larger but more uneven) — harmonizes them through a transparent, reproducible ETL pipeline, and uses the result to revisit Ramsay MacMullen's classic 1982 thesis of an "epigraphic habit" that rose with the early Empire and collapsed in late antiquity. The empirical findings are largely conservative — the rise-and-fall pattern is real — but the methodological argument is bolder: digital epigraphy can only do macro-history responsibly if dataset construction is itself a scholarly, citable, reproducible artifact rather than an undocumented black box.
本文采集了两个最大的拉丁铭文数字语料库 —— EDH(约 81,000 条,经同行评议、注释丰富)与 EDCS(约 540,000 条,规模更大但条目良莠不齐)—— 通过一条透明、可复现的 ETL 流水线进行统一处理,并据此重审 Ramsay MacMullen 1982 年提出的经典命题:"铭文习俗 (epigraphic habit)"在罗马早期帝国兴起、在晚期古代崩塌。实证发现整体保守 —— 那条兴衰曲线确实存在 —— 但方法论上的主张更为激进:只有当数据集的"建构过程本身"被作为可引用、可复现的学术成果而非黑箱时,数字铭文学才能负责任地从事宏观史。
§ 1.5The scholarly lineage this paper inherits and improves本文所继承并推进的学术谱系
Before reading the article on its own terms, it helps to see where it stands. Heřmánková, Kaše & Sobotková do not arrive at digital epigraphy fresh: they enter a discipline with a 200-year print tradition, a 40-year computational lineage, and a still-emerging consensus on what reproducible scholarship should look like. This section names the three streams the paper draws on and the specific gap it tries to close.
在按论文自身条件读之前,先看清它站在哪里。Heřmánková、Kaše 与 Sobotková 并非凭空进入数字铭文学:他们身后是 200 年的印本传统、40 年的计算谱系,以及一个仍在形成中的"何谓可复现学术"共识。本节点出三股传承,并指出本文尝试填补的具体缺口。
§ 1.5.1 The print-authority chain (1828 — today)§ 1.5.1 印本权威链(1828 — 今)
Latin and Greek epigraphy as a modern discipline begins with the great Berlin corpus projects. The Corpus Inscriptionum Graecarum (August Boeckh, 1828–) and the Corpus Inscriptionum Latinarum (Theodor Mommsen, 1853–) committed European classical scholarship to the principle that every ancient inscription should be re-edited, dated, contextualised, and assigned a stable citation. CIL is still the master reference for Latin epigraphy 170 years after its first volume.
作为现代学科的拉丁与希腊铭文学始于柏林两大集成项目。《希腊铭文集成》(August Boeckh,1828–)与《拉丁铭文集成》(Theodor Mommsen,1853–)确立了欧洲古典学的一项原则:所有古代铭文都应被重新校订、定年、置于语境中、并被赋予稳定的引用。即使在第一卷出版 170 年后,CIL 仍是拉丁铭文学的根本参照。
Three generations of scholars built this tradition
三代学者建立了这一传统
- The founders (mid-19th century): Boeckh, Mommsen, Kaibel, Dessau. Built the master corpora; established autopsy + apparatus criticus + bibliography as the editorial standard.
- The mid-century synthesizers (1920s–1970s): Louis Robert (Paris), Ronald Syme, A. H. M. Jones, Joyce M. Reynolds (Cambridge). Wrote the analytical literature that made inscriptions usable for ancient social history; ran the annual round-ups (BE, AE, SEG).
- The macro-historians (1980s–1990s): Ramsay MacMullen (Yale), Werner Eck, Géza Alföldy, Greg Woolf. Asked corpus-scale questions ("did inscribing rise then fall?") that demanded counting. The 1982 MacMullen article — which Heřmánková et al. revisit — is the seminal corpus-statistics paper.
- 奠基者(19 世纪中期):Boeckh、Mommsen、Kaibel、Dessau。建造了大集成;确立了"实地察看 + 异文校勘 + 参考书目"的编辑规范。
- 中期综合者(1920s–1970s):Louis Robert(巴黎)、Ronald Syme、A. H. M. Jones、Joyce M. Reynolds(剑桥)。写出使铭文能为社会史所用的分析性文献;主持年度评论(BE、AE、SEG)。
- 宏观史学者(1980s–1990s):Ramsay MacMullen(耶鲁)、Werner Eck、Géza Alföldy、Greg Woolf。提出语料库级的问题("刻铭是否先升后降?")—— 此类问题需要数得出来。MacMullen 1982 年那篇 —— 即 Heřmánková 等人重审的论文 —— 是语料库统计的开山作。
The print tradition's quiet assumption: that the printed corpus is itself the data. To do statistics on the field, one counted CIL volumes, AE entries, SEG numbers — and accepted that one's totals were partial, but at least made of citable, autopsy-grounded readings.
印本传统的默契假设:印本集成本身即数据。要做学科统计,就数 CIL 卷、AE 条目、SEG 号 —— 接受总数的不完整,但至少由可引用、有实地察看依据的释读所构成。
§ 1.5.2 The digital turn (1980s — today)§ 1.5.2 数字转向(1980s — 今)
Digital epigraphy is roughly 40 years old. It moved through four overlapping waves, each of which solved one problem and revealed the next:
数字铭文学约有 40 年。它经过四次叠加的浪潮,每一次解决一个问题、又暴露下一个问题:
The point: Heřmánková, Kaše & Sobotková arrive at the cusp of Wave 3 ↔ Wave 4. They use the encoding work of Wave 3 (TEI EpiDoc, structured EDH metadata) and explicitly bring the methodological commitments of Wave 4 (Monte Carlo, ETL transparency, parquet open distribution) into Latin epigraphy.
关键:Heřmánková、Kaše 与 Sobotková 站在第三波与第四波交界处。他们使用第三波的编码工作(TEI EpiDoc、结构化的 EDH 元数据),并显式把第四波的方法论承诺(蒙特卡洛、ETL 透明、parquet 开放分发)带入拉丁铭文学。
§ 1.5.3 Where Heřmánková, Kaše & Sobotková improve the scholarship§ 1.5.3 Heřmánková、Kaše 与 Sobotková 在哪里推进了学术
The paper's specific contribution is sharper than "we redid MacMullen with bigger data." It is a methodological synthesis. Five concrete moves the authors make that did not exist together before:
本文的具体贡献比"我们用更大数据重做了 MacMullen"要锐利得多,是一次方法论综合。作者做了五项具体动作 —— 此前没有任何研究把这五项同时做齐:
| Move动作 | What it does作用 | Inheriting from承自 | Improving on改进了 |
|---|---|---|---|
| 1. Open ETL | Every step of cleaning is a code commit每一步清洗都是一次代码提交 | Wave 4 (FAIR, software-engineering standards)第四波(FAIR、工程化标准) | EDCS-only workflows where data origin is opaque仅用 EDCS 的工作流 —— 数据来源不透明 |
| 2. Tempun Monte Carlo | Date uncertainty is sampled, not collapsed日期不确定性被抽样而非塌缩为单点 | Bayesian dating in archaeology (Bronk Ramsey, Buck)考古学中的贝叶斯定年(Bronk Ramsey、Buck) | Earlier counting that pinned each inscription to a single midpoint year把每条铭文钉到单一中点年的早期计数 |
| 3. Two-database honest comparison | EDH and EDCSx counted side-by-side; the gap is the findingEDH 与 EDCSx 并排计数;差距即发现 | Open-source benchmark culture (Kaggle, MLPerf)开源基准文化(Kaggle、MLPerf) | Studies that pick one database and report a single number仅用一个库、报告一个数字的研究 |
| 4. Pleiades georeferencing | Every record has a stable Pleiades URI for its place每条记录的地名都附稳定的 Pleiades URI | Ancient World Mapping Center / Pelagios古代世界制图中心 / Pelagios | Free-text place names that don't join across datasets不能跨库联接的自由文本地名 |
| 5. Methodological reflection in the journal medium | JDH 3-layer format makes method visibleJDH 三层式让方法可见 | Replication / pre-registration movement复现 / 预注册运动 | Conventional articles where method is a paragraph"方法仅占一段"的传统论文 |
In one sentence
一句话
The improvement is not that the answers are different from MacMullen's — they mostly aren't — but that the path from question to answer is now itself a checkable scholarly object. Earlier macro-history asked you to trust the historian's count; this paper asks you to re-run their pipeline.
改进并不在于答案不同于 MacMullen —— 大多数答案相同 —— 而在于"从问题到答案的那条路径"本身现在是一个可被检查的学术对象。早期宏观史要求你信任史家的计数;本文要求你重跑他们的管道。
Adjacent contemporaries who matter: Roger Bagnall (papyrology, scale, demographic estimation), John Bodel (US Epigraphy Project, methodological reflection), Anne Mahoney and Gabriel Bodard (Stoa, EpiDoc), Sebastian Heath, Hugh Cayless. The JDH 2021 paper sits in this conversation, not above it.
相邻同代要紧的人物:Roger Bagnall(莎草纸学、规模、人口估算)、John Bodel(美国铭文项目、方法论反思)、Anne Mahoney 与 Gabriel Bodard(Stoa、EpiDoc)、Sebastian Heath、Hugh Cayless。JDH 2021 这篇论文在此对话之中,而非凌驾其上。
§ 2The JDH three-layer formatJDH 的"三层式"论文形态
Before the article's content, a word about its form. This paper appeared in the inaugural issue of the Journal of Digital History, which uses a deliberate three-layer publication format. Click each layer:
在内容之前,先说论文的形式。本文发表于《数字史学杂志》(Journal of Digital History, JDH)创刊号;该期刊采用刻意设计的三层式发表形态。点击每一层:
The three-layer format is not decoration — it is part of the argument the article makes. If the paper's claim is that data construction must itself be transparent and reusable, then the journal article must be a kind of object you can inspect, re-run, and modify. The medium enacts the message.
三层式不只是装饰 —— 它本身是论文的一部分主张。如果论文要主张"数据构建必须透明、可重用",那这篇期刊论文就必须是可被检查、可被重跑、可被修改的对象。媒介本身就是讯息。
§ 3The research questions研究问题
The paper interleaves two questions — one empirical, one methodological — and the second is more important.
论文同时追问两个问题 —— 一是经验性、一是方法论 —— 后者更为关键。
What does the long-term temporal, geographic, and typological distribution of Latin inscriptions look like at full corpus scale?
在全语料尺度下,拉丁铭文的长期时空与类型分布到底是什么样?
Concretely: does MacMullen's "epigraphic habit" curve — rise in the early Empire, peak in the late 2nd / early 3rd c. CE, sharp decline through Late Antiquity — survive when computed not from a hand-picked sample but from every inscription EDH and EDCS hold?
具体而言:MacMullen 提出的"铭文习俗"曲线 —— 帝国早期上升、2 世纪末至 3 世纪初达顶峰、晚期古代陡降 —— 当我们用 EDH 与 EDCS 全部铭文(而非手挑样本)计算时,是否依然成立?
What does it mean — epistemologically — to treat humanistic source material as data?
把人文史料当作"数据",在认识论上意味着什么?
Inscriptions are not raw observations of antiquity. They were carved, found, transcribed, edited, restored, dated, indexed, digitized, and finally exposed via API or HTML — every link in that chain is an interpretive act. What does it cost to forget that?
铭文并不是古代世界的"原始观察"。它们经过镌刻、发掘、转录、编校、修复、定年、编目、数字化,最终通过 API 或 HTML 接口暴露 —— 这条链条上每一环都是一次诠释行为。如果我们忘了这一点,会付出什么代价?
§ 4Two databases, side by side两个数据库并置
The paper does not treat EDH and EDCS as interchangeable suppliers of "Latin inscriptions." It treats them as different scholarly artifacts with different editorial commitments, different scope, and therefore different blind spots.
论文并不把 EDH 与 EDCS 当作可互换的"拉丁铭文供应商"。它把它们看作不同的学术成品:编辑取向不同、覆盖范围不同、因而盲区也不同。
EDH · Heidelberg
Curated by a team at Heidelberg University. Each inscription is editorially reviewed, dated against published scholarship, georeferenced via the Pleiades gazetteer, and exposed through a public JSON API and downloadable EpiDoc XML. (EDH belongs to Family 1 — aggregators; for the broader six-family map see the Atlas.)
由海德堡大学的团队编校。每条铭文都经编辑审定、依发表的学术成果定年、通过 Pleiades 古地名词典进行地理参引,并通过公共 API 与可下载的 EpiDoc XML 同时开放。(EDH 属于 家族一 · 聚合器;六家族总览见 数据库地图。)
- Strong on western provinces (Italy, Gaul, Germania, Pannonia)
- High data quality, low quantity
- Easy to script against
- 对西方诸行省(意大利、高卢、日耳曼尼亚、潘诺尼亚)覆盖较强
- 质量高、数量小
- 易于程序化访问
EDCS · Clauss/Slaby
Compiled and maintained by Manfred Clauss and Anne Kolb. Aspires to ingest everything published in the major Latin-epigraphy print corpora (CIL, AE, etc.). Larger but with less per-record curation, and exposed only as a public-search HTML interface — no API.
由 Manfred Clauss 与 Anne Kolb 编纂维护。目标是收纳已发表于各大拉丁铭文丛刊(CIL、AE 等)的全部条目。规模更大,但单条精修较浅,且仅以公共搜索 HTML 界面对外开放 —— 没有 API。
- Maximum coverage including the eastern Empire
- Inconsistent dating granularity across centuries-long ranges
- Must be scraped (Lat Epig 2.0 takes 4–5 hours)
- 覆盖最大,含东方诸行省
- 定年颗粒度参差,常见跨世纪范围
- 只能抓取(Lat Epig 2.0 工具需 4–5 小时)
A third entity appears in the paper's figures: EDCSx. This is the authors' filtered subset of EDCS, restricted to inscriptions that are dated and located precisely enough to compare meaningfully with EDH. The paper's most carefully argued figures use EDCSx, not raw EDCS — because applying EDH-quality filters to EDCS makes the two corpora actually comparable.
论文图中还出现一个第三方实体:EDCSx。这是作者从 EDCS 筛出的子集 —— 限定为"定年与定位都足够精细,能够与 EDH 公平对比"的条目。论文最严谨的图表用的是 EDCSx,不是原始 EDCS —— 因为只有把"EDH 质量"的筛子套到 EDCS 上,两库才真正可比。
§ 5From databases to data从数据库到数据
Roughly half of the article is devoted to how EDH and EDCS were turned into a single comparable analysis-ready table. This is unusual for a humanities paper and is the article's central methodological move.
论文约一半篇幅讲述如何把 EDH 与 EDCS 化为一张可分析的可比表格。在人文论文中这并不常见,也是该文最重要的方法论举措。
The four moves
四个动作
- Extract. Walk EDH's API; scrape EDCS province by province with Lat Epig 2.0; harvest the EpiDoc XML dumps to recover dating prose the API has flattened.
- Transform. Standardize dates as integer ranges; resolve places against Pleiades; clean inscription text in two variants (interpretive + conservative); harmonize incompatible inscription-type taxonomies.
- Load. Publish two artifacts on Zenodo with separate DOIs — one for the dataset, one for the scripts — and mirror them to a public sciencedata.dk folder for unauthenticated read access.
- Date probabilistically. Treat each inscription's date range as a probability distribution. Draw thousands of Monte Carlo samples per record. Aggregate. This is what later becomes the tempun package.
- 提取(Extract):遍历 EDH 的 API;用 Lat Epig 2.0 按行省抓取 EDCS;下载 EpiDoc XML 转储,找回 API 已经压扁掉的日期文字描述。
- 转换(Transform):把日期标准化为整数区间;用 Pleiades 消解地名;分两个版本清洗铭文文本("诠释版"与"保守版");统一两库不兼容的铭文类型分类。
- 加载(Load):在 Zenodo 发布两个独立 DOI —— 一个给数据集、一个给脚本 —— 并镜像到 sciencedata.dk 公共文件夹供免登录读取。
- 概率定年:把每条铭文的日期区间视为概率分布。每条记录抽取数千个蒙特卡洛 (Monte Carlo) 样本,再做汇总。这就是后来成为 tempun 包的那套方法。
§ 6Findings — the seven figures, walked through研究发现 —— 逐图解读七张图
The article's substantive analysis is carried by seven figures. Each is reproduced from the companion notebook and visualized below in stylized form. Click controls to switch views.
论文的实证分析由七张图支撑。每张图都来自配套笔记本,下文以风格化形式重绘。点击控件切换视图。
Fig 1 — The shape of the epigraphic habit图 1 —— 铭文习俗的曲线
A single curve showing how many Latin inscriptions are dated to each year, summed across the entire corpus and weighted by probabilistic dating. The shape is the central empirical object of the paper.
单条曲线,把整个语料按概率定年加权后,逐年求和铭文数量。这条曲线是论文的核心实证对象。
MacMullen's "epigraphic habit" pattern is real, even at full corpus scale.
即使在全语料尺度下,MacMullen 的"铭文习俗"曲线依然成立。
The rise-peak-fall shape is not an artifact of the small samples MacMullen worked with in 1982 — it persists across all 600k+ Latin inscriptions in EDH and EDCS combined.
这条"先升后降"的曲线并非 MacMullen 1982 年所用小样本的伪迹 —— 把 EDH 与 EDCS 全部 60 万余条拉丁铭文加在一起,曲线依然如是。
Fig 2 — Inscription types over time图 2 —— 铭文类型随时间变化
Same temporal axis, but now broken down by kind of inscription: epitaphs vs honorific dedications vs votive offerings vs building inscriptions. Are the rise-and-fall patterns the same for all types? They are not.
仍以时间为横轴,但按类型拆分:墓志、荣誉献辞、祝愿献辞、建筑铭文。各类型曲线是否同步?答案是:不。
"Latin inscriptions" is mostly "Roman epitaphs."
所谓"拉丁铭文",绝大多数是"罗马墓志"。
Across both corpora, funerary inscriptions outnumber every other category combined. Any aggregate analysis of "the epigraphic habit" is therefore largely an analysis of Roman commemorative funerary practice.
在两库中,葬仪类铭文的数量都超过其他所有类型之和。因此任何关于"铭文习俗"的整体分析,本质上都主要是在分析罗马的纪念性葬仪实践。
Fig 3 — Periodized comparison: EDH vs EDCSx图 3 —— 分期比较:EDH 与 EDCSx
Same data, but bucketed into chronological phases (Republic / Early Imperial / High Imperial / Late Imperial / Late Antique) and shown side-by-side. This is where the editorial fingerprint of each database becomes visible.
同样的数据,但按历史分期(共和 / 早期帝国 / 盛期帝国 / 晚期帝国 / 晚期古代)分桶并并置展示。两库的"编辑指纹"在此显形。
Figs 4–6 — Where the inscriptions are图 4–6 —— 铭文在哪里
Three province-distribution charts, one per dataset version: EDH, EDCS (all), EDCSx (filtered). Italy and the city of Rome dominate every version, but the rank order of the next provinces shifts in revealing ways.
三张行省分布图,分别对应三个数据集版本:EDH、EDCS(全部)、EDCSx(筛选后)。无论哪一版,意大利与罗马城都居首;但其后行省的次序变化富有信息量。
Geographic coverage is profoundly uneven — and the unevenness is partly modern, not ancient.
地理覆盖极不均衡 —— 而不均衡是部分"现代造成",并非全是"古代如此"。
Italy and the city of Rome dominate every dataset, but so do the Rhine and Danube limes. This reflects both the genuine ancient distribution of inscribing communities and the modern history of where archaeologists looked, where 19th- and 20th-century epigraphic editions were prepared, and which regions ended up in databases first.
在每个数据集中,意大利与罗马都居首;莱茵河与多瑙河边境也居前。这既反映古代真实的"题写社群"分布,也反映现代史的影响 —— 考古学者在哪里发掘、19–20 世纪铭文丛刊在哪里编校、哪些地区先进入数据库。
Fig 7 — Rome zoomed: types over time, in the imperial capital图 7 —— 罗马城放大:帝国首都的类型随时间变化
A typologies-over-time chart, but restricted to Rome itself (using EDCSx). The point is to test whether the all-empire pattern holds in the densest single locality. It mostly does — but with sharper amplitude and faster late-antique decline.
一张"类型随时间"的曲线,但限定在罗马城(使用 EDCSx)。目的是测试帝国整体的模式在最密集的单一地点是否依然成立。基本成立 —— 但振幅更大,晚期古代下落更快。
[Stylized rendering of Fig7_Rome_Typologies_comparison_time_EDCSx.png — the same four-line plot as Fig 2, but for Rome only. Epitaph dominance is even more pronounced than in the empire-wide aggregate.]
[此处为 Fig7_Rome_Typologies_comparison_time_EDCSx.png 的风格化呈现 —— 与图 2 同样的四条曲线,但仅限罗马城。墓志的主导性比帝国整体还要明显。]
§ 7The methodological argument方法论主张
If you take only one thing from the paper, take this: the substantive findings are not the most important contribution. The most important contribution is a stance on what doing digital ancient history responsibly looks like.
如果只能从论文带走一件事,请带走这件:实证发现并不是最重要的贡献。最重要的贡献是一种姿态:负责任的数字古史学应该长什么样。
"We assert that research communities stand to gain from extending digital infrastructures to reduce barriers to access with packages of open and reusable research tools." "我们主张:研究共同体应当扩展数字基础设施,以开源、可重用的工具包降低使用门槛。" — Heřmánková, Kaše & Sobotková 2021, abstract —— Heřmánková, Kaše & Sobotková 2021, 摘要
Every figure must be re-derivable from the same code and data.
每张图都必须能由同一份代码与数据重新生成。
The paper's figures are not screenshots — they are outputs of the open Jupyter notebooks in sdam-au/digital_epigraphy. Anyone can re-run them, change a parameter, and get a different version. This is the operational meaning of FAIR data.
论文的图不是截图 —— 而是 sdam-au/digital_epigraphy 中开源 Jupyter 笔记本的运行结果。任何人都可以重跑、改参数、得到另一版结果。这就是 FAIR 数据 的可操作含义。
Dating uncertainty must be propagated, not hidden.
日期不确定性必须传播,不能被掩盖。
Midpoint dating is convenient and silently distorts the curves. The paper introduces (and the team's tempun package later formalizes) probabilistic dating: every inscription contributes a probability distribution across its date range. The resulting curves are smoother and more honest about what we don't know.
中点定年方便,但默默扭曲曲线。论文引入(团队后来在 tempun 包中正式化)概率定年:每条铭文以一个分布参与其日期区间。最终曲线更平滑,也更如实地承认"我们不知道的部分"。
Differences between EDH and EDCS are not noise to be averaged out.
EDH 与 EDCS 的差异不是要被平均掉的"噪声"。
When the two corpora disagree, that disagreement reveals something about each one's editorial history. Showing both side-by-side — rather than picking one and pretending it's "the corpus" — is itself a scholarly contribution.
当两库的结论不一致时,这种不一致揭示了各自的编辑史。把两库并置展示(而不是挑一个、当作"那个语料")本身就是一项学术贡献。
§ 8Strengths and limits优势与局限
The paper is admirably honest about what its corpus can and cannot say.
论文对自身语料"能讲什么、不能讲什么"相当坦诚。
What the paper establishes
论文证立的事
- The MacMullen rise-and-fall pattern survives at full corpus scale.
- Funerary epitaphs dominate every type-distribution at every period.
- EDH and EDCS, though different, agree on the gross shape of the curve.
- Italy + Rome + the western limes dominate geographic distribution.
- Dataset construction can be made transparent, reproducible, and citable.
- MacMullen 的"先升后降"在全语料尺度依然成立。
- 葬仪墓志在所有时期、所有类型分布中都占主导。
- EDH 与 EDCS 虽不同,但在曲线整体形状上一致。
- 意大利 + 罗马 + 西方边境主导地理分布。
- 数据集构建可以做到透明、可复现、可引用。
What the paper does NOT establish
论文没有证立的事
- That the curve reflects ancient reality more than modern collection bias.
- That the absolute numerical heights of the curve are reliable.
- That Greek inscriptions follow the same pattern (this paper is Latin-only).
- That eastern provinces are well-represented (they are not).
- That an inscription's date is more than a probability over centuries.
- 该曲线"反映古代现实多于现代收藏偏差"。
- 曲线的绝对高度是可靠的。
- 希腊铭文是否遵循同一模式(本文仅限拉丁)。
- 东方诸省被良好代表(实际并非如此)。
- 某条铭文的日期不只是一段世纪级的概率分布。
§ 9Implications意义
For digital classics. The paper sets a benchmark: future macro-historical claims about ancient inscriptions should come bundled with an open ETL pipeline, an explicit dating-uncertainty model, and a versioned dataset DOI. Anything less is harder to take seriously.
对数字古典学:论文树立了一个标杆 —— 今后任何关于古代铭文的宏观史断言,都应同时附上开源的 ETL 流水线、明确的日期不确定性模型,以及带版本号的数据集 DOI。低于这一标准的工作将更难令人信服。
For ancient history more broadly. The paper extends well beyond epigraphy: it suggests that the unit of scholarly work in digital history should include the data engineering, not bracket it as "preliminary." The SDAM follow-on projects (Greek inscriptions, Greek texts, Bulgarian burial mounds) demonstrate this is a generalizable stance, not specific to Latin epigraphy.
对更宏观的古史:论文意涵远超铭文学 —— 它主张数字史学的"学术单位"应当把数据工程内置,而不是把它当作"预备阶段"括号掉。SDAM 项目的后续工作(希腊文铭文、希腊文献、保加利亚坟丘)证明这一姿态可推广,并不限于拉丁铭文学。
For data-as-method debates. Treating sources "as data" is contested in the humanities. The paper's answer is not "yes you can" or "no you can't" — it is "you can if you make every transformation visible." The transformation chain is the scholarship.
对"数据即方法"的论争:在人文学科里,"把史料当作数据"的做法颇有争议。论文的回答既不是简单的"可以",也不是"不可以",而是"如果你把每一次变换都暴露出来,就可以"。变换链本身就是学术。
§ 10How to engage with the paper如何深入研习
- Read the article on JDH — doi.org/10.1515/jdh-2021-1004. The three-layer format makes the most sense in the JDH player itself.
- Browse the Visual edition — sdam-visual-slideshow.html — for an interactive, button-driven tour of the ETL machinery underlying every figure in the paper.
- Open the Reference edition — sdam-reference.html — for a complete technical companion, including the full SDAM repository map and code samples.
- Run the notebook — clone sdam-au/digital_epigraphy and re-execute every figure. Try changing the dating method to midpoint and see how Fig 1's amplitude shifts.
- Use the data — load
EDH_text_cleaned_2022_11_03.jsondirectly viapandas.read_json()from the public sciencedata.dk URL and start your own analysis. - Cite — both the article DOI and the dataset DOI 10.5281/zenodo.7303886.
- 在 JDH 上阅读论文 —— doi.org/10.1515/jdh-2021-1004。三层式形态在 JDH 自家的阅读器里效果最好。
- 浏览视觉版 —— sdam-visual-slideshow.html —— 对论文每一张图背后 ETL 机器进行按钮驱动的交互式游览。
- 打开参考版 —— sdam-reference.html —— 完整的技术配套,含 SDAM 全部仓库索引与代码示例。
- 跑一遍笔记本 —— 克隆 sdam-au/digital_epigraphy 并重新执行每张图。试把定年方式改为中点,看图 1 的振幅如何变化。
- 使用数据 —— 通过
pandas.read_json()从 sciencedata.dk 公共 URL 直接加载EDH_text_cleaned_2022_11_03.json,开始自己的分析。 - 引用 —— 论文 DOI 与数据集 DOI 10.5281/zenodo.7303886 同时引用。
Two-line code, no clone needed
两行代码,无需克隆
import pandas as pd EDH = pd.read_json("https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/EDH_text_cleaned_2022_11_03.json") # 81,883 inscriptions ready
§ 11Bibliography & further reading参考文献与延伸阅读
The paper itself
本文
- Heřmánková, P., Kaše, V., & Sobotková, A. (2021). Inscriptions as data: digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1). doi.org/10.1515/jdh-2021-1004
Companion code & data
配套代码与数据
- sdam-au/digital_epigraphy — companion Jupyter notebook
- Heřmánková, P., & Kaše, V. (2022). EDH_text_cleaned_2022_11_03 (v2.0) [Data set]. Zenodo. 10.5281/zenodo.7303886
- Heřmánková, P. (2022). EDCS_text_cleaned_2022_09_12 (v2.0) [Data set]. Zenodo. 10.5281/zenodo.7072337
- Kaše, V., Sobotková, A., & Heřmánková, P. (2023). Modeling Temporal Uncertainty in Historical Datasets. CHR 2023. CEUR-WS
The intellectual lineage
学术谱系
- MacMullen, R. (1982). The epigraphic habit in the Roman Empire. American Journal of Philology, 103(3), 233–246. — the founding article on the rise-and-fall pattern.—— "兴衰曲线"的奠基文。
- Bodel, J. (Ed.). (2001). Epigraphic Evidence: Ancient History from Inscriptions. Routledge. — the standard methodological introduction.—— 标准方法论导论。
Source databases
数据源
- Epigraphic Database Heidelberg — edh.ub.uni-heidelberg.de
- Epigraphik-Datenbank Clauss/Slaby — manfredclauss.de
- Pleiades gazetteer — pleiades.stoa.org
Companion editions in this trilogy
本三部曲的配套版本
- Visual Edition — an interactive slideshow on the SDAM ETL pipelines, button-driven, beginner-friendly.SDAM ETL 流水线的交互式幻灯片,按钮驱动,面向初学者。
- Reference Edition — a deep technical companion documenting all 37 SDAM repositories.深度技术配套,记录全部 37 个 SDAM 仓库。
- Landing page — overview of the trilogy with EN/中文 toggle.三部曲总览,含中英切换。