Body
Tombstones, altars, milestones, dedications — over 600,000 inscriptions survive from the Mediterranean world between 800 BCE and 800 CE. To study them at scale, scholars first need them as data.
墓碑、祭坛、里程碑、献辞 —— 公元前 800 年至公元 800 年间,地中海世界留存至今的 铭文 超过 60 万件。要从宏观尺度研究它们,学者首先得把它们变成 数据 (data)。
SDAM (Aarhus University) builds the open pipelines, packages, and analyses that turn these inscriptions and texts into clean, citable, reproducible datasets.
SDAM(奥胡斯大学 Aarhus University 古地中海社会动力学研究组)开发开源流水线、软件包与分析方法,把这些铭文与文本变成干净、可引用、可复现 (reproducible) 的数据集 (datasets)。
Click each letter for a plain-language definition and the actual SDAM notebooks that do the work.
点击每个字母 —— 既看通俗释义,也看到 SDAM 中实际执行该工作的代码笔记本(notebooks)。
Each one is a stage in the journey from messy source material to a clean dataset.
每一个字母对应从原始杂乱材料到干净数据集的一个阶段。
Two extract from different sources. The third combines them. Click any card for details.
两条从不同来源提取,第三条将它们合并。点击任意卡片查看详情。
Pulls inscriptions from the EDH via its public API + EpiDoc XML.
通过 EDH 的公共 API 与 EpiDoc XML 拉取铭文。
Scrapes the EDCS website province by province (no API).
按罗马行省抓取 EDCS 网站(没有 API)。
Merges both, dedupes, harmonizes types via machine learning.
合并两者并去重,借机器学习 (machine learning) 统一分类。
There are also pipelines for Greek inscriptions, Greek texts, and Bulgarian burial mounds — coming up.
还有针对希腊文铭文、希腊文文献、保加利亚坟丘的流水线 —— 后面会讲到。
Click each block in turn — left (the source) to right (researchers). Each correct placement reveals what that step actually does.
依次点击每个方块 —— 从左侧的数据源到右侧的研究者。每个正确放置都会展示该步骤的真实作用。
"Extract" means visiting wherever the data lives and bringing back a copy — without changing anything yet.
"提取(Extract)"指访问数据所在之处,把副本带回来 —— 此时不做任何修改。
Imagine going to a library and photocopying every page in a section. The cleanup happens later. SDAM stores the raw copy as a separate, named file (e.g. JSON) so the next stage can re-read it without re-extracting.
想象去图书馆把某一区的每一页都复印一份。整理留待之后。SDAM 把原始副本存成独立、命名清晰的文件(如 JSON),下一阶段可以重读,不必再次提取。
Deep dive in Reference参考版深入讲解Make a local copy of everything the source has, in whatever format it provides.
将数据源 所有内容 按其提供的格式做本地副本。
Heidelberg offers an API — you ask their database, it sends back the answer. Press the button.
海德堡提供 API —— 你向数据库发问,它返回答复。按下按钮试试。
You triggered an API request against the EDH endpoint /data/api/inscriptions/search. In real life this paginates 200 records per page over ~410 calls and takes about 12 minutes. The output is one JSON file with every inscription in EDH.
你刚刚向 EDH 端点 /data/api/inscriptions/search 发起了一次 API 请求。真实运行中以每页 200 条分页约 410 次调用完成,耗时约 12 分钟。结果是一个 JSON 文件,包含 EDH 中的所有铭文。
The actual code lives in the EDH_ETL repo, notebook 1_1_py_EXTRACTION_edh-inscriptions-from-web-api.ipynb.
真实代码在 EDH_ETL 仓库,笔记本文件为 1_1_py_EXTRACTION_edh-inscriptions-from-web-api.ipynb。
EDCS has no vending machine. A scraper visits every Roman province's page, one at a time. ~4–5 hours.
EDCS 没有自动售货机。一个 网页抓取工具 (scraper) 逐个访问每个罗马行省的页面,约需 4–5 小时。
You ran a simulated web scrape across 18 Roman provinces. The real scraper is Lat Epig 2.0, a Docker tool from Macquarie University that wraps EDCS's public search interface and saves one TSV per province. A full scrape takes 4–5 hours and produces ~537,000 cleaned Latin inscriptions.
你刚刚模拟了对 18 个罗马行省的 网页抓取 (web scrape)。实际工具是 Macquarie 大学开发的 Lat Epig 2.0,一个 Docker 容器化的工具,将 EDCS 的公共搜索接口包装起来,每个行省保存为一个 TSV 文件。完整抓取约需 4–5 小时,产生约 537,000 条清洗后的拉丁文铭文。
SDAM doesn't write the scraper — they consume its output. The instructions for running it (clone repo, switch branch, bash dockerScraperAll.sh) live in the EDCS_ETL README.
SDAM 并不编写这个抓取工具 —— 而是使用其输出。运行说明(克隆仓库、切换分支、bash dockerScraperAll.sh)记录在 EDCS_ETL 的 README 中。
Raw data is messy. One inscription says "AD 100", another "100 CE", a third "around the 2nd century". Are they the same? The transform stage decides.
原始数据杂乱无章。一条铭文写"AD 100",另一条写"100 CE",第三条写"约 2 世纪"。它们是同一回事吗?转换 (transform) 阶段来决定。
This is the most intellectual stage. Every cleaning rule is a scholarly choice — and SDAM keeps the original alongside the cleaned version so you can always see what changed.
这是最具思辨性的阶段。每一条清洗规则都是 学术抉择 (scholarly choice) —— SDAM 在保留清洗版本的同时也保留原始版本,让任何人随时可对比变更。
Transform deep dive in Reference参考版深入 (Transform)Turn many inconsistent records into one consistent table that a computer (and a historian) can analyze.
把众多不一致的记录变成一张一致的表格,方便计算机(和历史学家)分析。
Many inscriptions appear in both EDH and EDCS. The LI pipeline finds duplicates and merges them — using a machine-learning classifier to harmonize incompatible inscription-type taxonomies.
许多铭文同时出现在 EDH 和 EDCS 中。LI 流水线找出重复条目并合并 —— 借助机器学习分类器 (ML classifier) 统一两边互不兼容的铭文类型分类。
Two records — one EDH, one EDCS — were matched by CIL/AE reference, geographic proximity, date overlap, and text similarity. Their attributes were merged column-by-column. The inscription type was unified by an ML classifier trained on overlap inscriptions; the EDH-style label "epitaph" replaced EDCS's Latin "sepulcralis", with confidence p=0.94.
两条记录 —— 一条来自 EDH,一条来自 EDCS —— 通过 CIL/AE 编号、地理邻近、日期重叠、文本相似度匹配。属性逐列合并。铭文类型由一个在重叠铭文上训练的机器学习分类器统一:EDH 风格的标签 "epitaph" 取代了 EDCS 的拉丁标签 "sepulcralis",置信度 p=0.94。
The result is one row in the merged corpus called LIST (Latin Inscriptions in Space and Time). A spatio-temporally restricted subset called LIRE is published separately.
结果是合并语料 LIST(Latin Inscriptions in Space and Time,空间与时间中的拉丁文铭文)中的一行。另有一个时空范围更严格的子集 LIRE(Latin Inscriptions of the Roman Empire)单独发布。
Every clean dataset goes to two places. Click each for details.
每一份干净数据集去往两个地方。点击查看详情。
The team's shared cloud drive. Anyone can read the public folder.
团队的共享云盘。公共文件夹任何人都能读取。
Each release gets a DOI — a permanent web address.
每次发布都获得 DOI —— 永久不变的网址。
A 2nd-century stele from Africa Proconsularis. Press play.
一块来自阿非利加(Africa Proconsularis)行省的 2 世纪石碑。按播放键。
Most inscriptions aren't dated to a year — only to a range. Naive midpoint dating creates fake spikes. tempun uses Monte Carlo simulation to honestly spread probability across each range — propagating uncertainty as FAIR data principles require.
大多数铭文没有精确到年的日期 —— 只有一个范围。简单地用中点会造出虚假的峰值。tempun 使用 蒙特卡洛 (Monte Carlo) 模拟,把概率诚实地分布在每个区间内 —— 这正是 FAIR 数据原则要求的"传播不确定性"。
Click any card to see what it covers, where to download it, and what makes it different.
点击任意卡片查看其覆盖范围、下载位置以及独特之处。
Heidelberg, accessed via API + EpiDoc XML.
海德堡,通过 API + EpiDoc XML 获取。
Clauss/Slaby, scraped via Lat Epig.
克劳斯/斯拉比,通过 Lat Epig 抓取。
EDH + EDCS deduped, types harmonized.
EDH + EDCS 去重,类型统一。
PHI Greek inscriptions, enriched via Trismegistos.
PHI 希腊铭文集,通过 Trismegistos 增强。
1,958 Greek works, 35M tokens, lemmatized.
1,958 部希腊文作品,3500 万词元,已词形还原。
Same ETL discipline, applied to archaeology.
同样的 ETL 方法,用于考古学。
Every research project is its own open repo. Click any card to open its GitHub.
每个研究项目都是一个独立开源仓库。点击任意卡片打开其 GitHub。
Macro-historical analysis: rise and fall of inscription production.
宏观史分析:铭文生产的兴衰曲线。
digital_epigraphy · JDH 2021Reconstructing ancient road networks from inscription distribution.
从铭文分布重构古代道路网络。
epigraphic_roadsQuantitative analysis of recurring phrasing (D.M., vixit annis).
对常用套语(D.M.、vixit annis)的定量分析。
formulaeNLP experiments on inscription text. Connected Past 2021.
铭文文本的 NLP 实验。Connected Past 2021。
NLP_inscriptionsDivision of labor and occupational specialization across Roman cities.
罗马帝国各城市的分工与职业专业化。
social_diversityNumismatic exploration applying SDAM data discipline.
将 SDAM 数据规范应用于钱币研究。
coinsR functions calculating site visibility — the Roman field of view.
R 函数计算遗址间可视性 —— 罗马人的视野。
landscape_prominenceMediterranean-scale pedestrian travel times between settlements.
地中海尺度的聚落间步行时间。
landscape-travelCharacter recognition and pre-cleaning of inscription texts.
铭文文本的字符识别与预清洗。
OCR · epigraphic_cleaningWrappers around the ugly parts (auth, file paths, dating math) so notebooks stay short.
将麻烦的部分(鉴权、文件路径、日期计算)封装起来,让笔记本保持简短。
One-line access to sciencedata.dk: read and write JSON/CSV/parquet from a Danish national research-data folder as if it were local.
一行代码访问 sciencedata.dk:把丹麦国家科研数据中心的文件夹当本地用,可读写 JSON/CSV/parquet。
Click for details ↗点击查看详情 ↗
R toolkit with built-in EDH dataset, place-name maps, and a prex() function for "probability of existence" — bin any inscription's date range.
R 工具包,内置 EDH 数据集、地名地图,并提供 prex() 函数计算"存在概率"——将任何铭文的日期范围分箱到时段中。
Click for details ↗点击查看详情 ↗
Heřmánková, Kaše & Sobotková (2021) appeared in the inaugural issue of the Journal of Digital History, which uses a deliberate three-layer publication format. The paper revisits the epigraphic habit thesis at full corpus scale. Click each layer.
Heřmánková, Kaše & Sobotková 三人 2021 年的论文,发表在 《数字史学杂志》 创刊号。该刊采用刻意设计的三层式发表形态;该论文在全语料尺度下重审 "铭文习俗" 命题。点击每一层。
Click any tile to see how that figure shapes the paper's claim.
点击任意卡片,查看该图如何支撑论文的论点。
Each figure is reproduced and explained in the Paper Edition. The actual PNGs are in the digital_epigraphy repo, generated by the companion notebook.
每张图在论文版中都有重绘与解读。原始 PNG 在 digital_epigraphy 仓库中,由 配套笔记本 生成。
A bilingual marble plaque from Roman Sicily — ISic000470, a stonecutter's shop sign in Greek and archaic Latin — appears across five major epigraphic databases. Each says something different about it.
罗马时代西西里的一块双语大理石广告板 —— ISic000470,用希腊文与古风拉丁文写就的石匠铺招牌 —— 同时被五个主要铭文数据库收录,每一个所说都略有不同。
Four database IDs (one repeated), one DOI, six print-corpus references, two epigraphic-bulletin clusters. The pulsing chips are the same physical inscription listed twice in PHI.
四个数据库 ID(其中一个重复)、一个 DOI、六个印本参引、两组铭文学公报参引。闪动的两个胶囊就是 PHI 把同一块石头记成两条的那两个 ID。
Without a multi-way crosswalk, automated dedup over-counts this single inscription up to 5×. Only I.Sicily's TEI publicationStmt records all five external IDs in one place.
没有多向对照表,自动去重最多会把这块石头计 5 次。只有 I.Sicily 的 TEI publicationStmt 在一处同时记录全部五个外部编号。
Before any database existed, this inscription had been printed in two great 19th-century corpora. Each editor made different choices. The five modern databases inherited some of those choices and silently dropped others.
在任何数据库出现之前,这块铭文已两次进入 19 世纪的印本丛刊。两位编者做出不同的取舍 —— 现代五库继承了一部分,悄悄丢掉了另一部分。
QVM on the stone to CVM. EDCS, EDR, and PHI 140601 all inherit this 19th-c. fix. Only I.Sicily's <choice> markup keeps both forms.
两位编者都把石上古拼写 QVM 默默规范为 CVM。EDCS、EDR、PHI 140601 都继承了这个 19 世纪修正。只有 I.Sicily 的 <choice> 标记同时保留两种形式。
Every SDAM ETL pipeline mixes both. The trick is knowing what each is good for.
SDAM 每条 ETL 流水线都混用两者。关键是知道各自擅长什么。
.ipynbCode, prose, output, charts, citations — all in one browser-based file. Each cell runs on demand; outputs save inline. SDAM uses it for extraction (Python, calling APIs and parsing XML).
代码、散文、输出、图表、引证 —— 全在一个浏览器文件里。逐格按需执行,输出就地保存。SDAM 用它做提取(Python,调 API、解析 XML)。
Why it matters: the JDH paper's seven figures live inside one notebook — reopen it, re-run, get the same charts. Reproducibility ships with the document.
为何重要:JDH 论文那七张图就装在一个笔记本里 —— 重新打开、重新执行、得到同样的图。可复现性随文件一同发货。
.RmdA language built for data analysis since 1993. Its tidyverse packages (dplyr, stringr, ggplot2) offer a clean syntax for row-by-row table operations. SDAM uses it for cleaning (Transform stage — coercing dates, normalizing text, harmonizing categories).
1993 年起为数据分析量身打造的语言。它的 tidyverse 包族(dplyr、stringr、ggplot2)提供了对表格行级操作的洁净语法。SDAM 用它做清洗(转换阶段 —— 强制类型、规范文本、统一分类)。
Why R for cleaning? R's pipe syntax (|>) chains regex substitutions cleanly: strip brackets → strip parens → collapse whitespace → done. The EDH 1_5 notebook is the showcase.
为何用 R 做清洗:R 的管道语法 |> 把正则替换链接得清爽:去括号 → 去圆括号 → 折叠空白 → 完成。EDH 1_5 笔记本 是范例。
Letter heights from the I.Sicily TEI: line 1 = 22 mm, line 2 = 20 mm, line 3 = 8 mm, lines 4–7 = 10 mm. The dramatic drop from line 2 to line 3 is itself a paleographic signal — but only this kind of zone-anchored facsimile can encode it.
来自 I.Sicily TEI 的字高:行 1 = 22 mm,行 2 = 20 mm,行 3 = 8 mm,行 4–7 = 10 mm。第 2 行到第 3 行的剧烈跌落本身就是一个字形学信号 —— 而只有这种带分区锚定的摹本才能编码它。
Press the button to feed all five database records into a naïve deduplication pipeline. Watch what breaks.
按下按钮,把五条记录喂给一个朴素的去重流水线 —— 看看什么会出问题。
This is exactly the kind of mess the JDH paper's deduplication step was designed to absorb — and exactly why it took half the article to document.
这正是 JDH 论文的去重步骤所要消化的局面 —— 也正是论文用一半篇幅来交代它的原因。
No clone, no Docker, no scraper. The finished datasets are on a public URL.
无需克隆、无需 Docker、无需抓取。成品数据集都挂在公开 URL 上。
Click any of these to download the cleaned dataset:
点击任意按钮下载已清洗数据集:
If you have Python:
如果你装了 Python:
Or for Greek texts (LAGT):
希腊文文献(LAGT):
APIs, scrapers, XML, CSVs
API、抓取、XML、CSV
Clean, dedupe, classify
清洗、去重、分类
sciencedata.dk + Zenodo
sciencedata.dk + Zenodo
Plus tempun for uncertainty, 9+ analysis repos, two helper packages, six ETL pipelines, and 37 repositories total.
还有处理时间不确定性的 tempun、9 个以上分析仓库、两个辅助包、六条 ETL 流水线,共 37 个仓库。