SDAM ETL · Visual视觉版
Visual Reference参考 Paper论文 Case案例 DB literacyDB 识读 Atlas →地图 →
Start

From Stone to Data

从石头 到数据

A visual tour of the SDAM ETL pipelines — explained without jargon.
SDAM ETL 流水线(pipelines)视觉导览 —— 通俗易懂的讲解。
Skip to deep reference →跳至深度参考 →
Use arrows · click any underlined term · swipe on phone
使用 方向键 · 点击 带下划线的术语 查看释义 · 手机可滑动

Why this existsThe Romans (and Greeks) wrote on stone.

缘起罗马人(与希腊人)刻字 于石

Tombstones, altars, milestones, dedications — over 600,000 inscriptions survive from the Mediterranean world between 800 BCE and 800 CE. To study them at scale, scholars first need them as data.

墓碑、祭坛、里程碑、献辞 —— 公元前 800 年至公元 800 年间,地中海世界留存至今的 铭文 超过 60 万件。要从宏观尺度研究它们,学者首先得把它们变成 数据 (data)

~600k
inscriptions铭文
~1,600 yrs
temporal range时间跨度
3 continents大陆
geographic spread地理范围
37 repos仓库
SDAM open codebaseSDAM 开源代码

SDAM (Aarhus University) builds the open pipelines, packages, and analyses that turn these inscriptions and texts into clean, citable, reproducible datasets.

SDAM(奥胡斯大学 Aarhus University 古地中海社会动力学研究组)开发开源流水线、软件包与分析方法,把这些铭文与文本变成干净、可引用、可复现 (reproducible) 的数据集 (datasets)

A quick lineage: 一条简短谱系: Latin / Greek epigraphy as a discipline begins with Mommsen's CIL (1853–) and Boeckh's CIG (1828–) — the great Berlin print corpora. The digital turn began with PHI CD-ROMs (1985), grew via web databases (1995–), settled on TEI EpiDoc (2005–), and is now shifting toward open-data + reproducibility (2015–). The SDAM ETL pipelines you'll see in this tour belong to that fourth wave. → Full lineage in Paper § 1.5 作为学科的拉丁/希腊铭文学始于 Mommsen 的 CIL(1853–)与 Boeckh 的 CIG(1828–) —— 柏林两大印本集成。数字转向起于 PHI 光盘(1985),经 Web 数据库(1995–)、TEI EpiDoc(2005–),如今正向开放数据与可复现(2015–)转换。本导览中你将看到的 SDAM ETL 管道,属于这第四波。→ 完整谱系见论文版 § 1.5

The big ideaThree steps. Three letters. E · T · L.

核心思想三步。三个字母。E · T · L

Click each letter for a plain-language definition and the actual SDAM notebooks that do the work.

点击每个字母 —— 既看通俗释义,看到 SDAM 中实际执行该工作的代码笔记本(notebooks)。

Pick a letter above ↑

点击上方任一字母 ↑

Each one is a stage in the journey from messy source material to a clean dataset.

每一个字母对应从原始杂乱材料到干净数据集的一个阶段。

Meet the teamThree core pipelines for Latin inscriptions.

认识主角三条针对 拉丁文铭文 的核心流水线。

Two extract from different sources. The third combines them. Click any card for details.

两条从不同来源提取,第三条将它们合并。点击任意卡片查看详情。

EDH_ETL

Heidelberg海德堡 (Heidelberg)

Pulls inscriptions from the EDH via its public API + EpiDoc XML.

通过 EDH 的公共 APIEpiDoc XML 拉取铭文。

≈ 81,883 inscriptions条铭文
EDCS_ETL

Clauss/Slaby克劳斯/斯拉比 (Clauss/Slaby)

Scrapes the EDCS website province by province (no API).

按罗马行省抓取 EDCS 网站(没有 API)。

≈ 537,286 inscriptions条铭文
LI_ETL

Combined corpus合并语料 (combined corpus)

Merges both, dedupes, harmonizes types via machine learning.

合并两者并去重,借机器学习 (machine learning) 统一分类。

→ LIST & LIRE

There are also pipelines for Greek inscriptions, Greek texts, and Bulgarian burial mounds — coming up.

还有针对希腊文铭文、希腊文文献、保加利亚坟丘的流水线 —— 后面会讲到。

Try it yourselfBuild the pipeline in the right order.

动手试试正确顺序 搭建 流水线

Click each block in turn — left (the source) to right (researchers). Each correct placement reveals what that step actually does.

依次点击每个方块 —— 从左侧的数据源到右侧的研究者。每个正确放置都会展示该步骤的真实作用。

Pick a block from below to start ↓ 从下方选择一个方块开始 ↓
Five blocks, one correct order.五块拼图,一个正确顺序。
What just happened刚才发生了什么

Stage 1 — ExtractGo and get the raw stuff.

第一阶段 —— 提取原始材料 取来。

"Extract" means visiting wherever the data lives and bringing back a copy — without changing anything yet.

"提取(Extract)"指访问数据所在之处,把副本带回来 —— 此时不做任何修改。

Imagine going to a library and photocopying every page in a section. The cleanup happens later. SDAM stores the raw copy as a separate, named file (e.g. JSON) so the next stage can re-read it without re-extracting.

想象去图书馆把某一区的每一页都复印一份。整理留待之后。SDAM 把原始副本存成独立、命名清晰的文件(如 JSON),下一阶段可以重读,不必再次提取。

Deep dive in Reference参考版深入讲解
📥
Goal目标

Make a local copy of everything the source has, in whatever format it provides.

将数据源 所有内容 按其提供的格式做本地副本。

EDH wayLike a vending machine.

EDH 方式像一台 自动售货机

Heidelberg offers an API — you ask their database, it sends back the answer. Press the button.

海德堡提供 API —— 你向数据库发问,它返回答复。按下按钮试试。

🏛️
EDH databaseEDH 数据库
edh.ub.uni-heidelberg.de
No request yet尚未发起请求
Empty
What just happened刚才发生了什么

You triggered an API request against the EDH endpoint /data/api/inscriptions/search. In real life this paginates 200 records per page over ~410 calls and takes about 12 minutes. The output is one JSON file with every inscription in EDH.

你刚刚向 EDH 端点 /data/api/inscriptions/search 发起了一次 API 请求。真实运行中以每页 200 条分页约 410 次调用完成,耗时约 12 分钟。结果是一个 JSON 文件,包含 EDH 中的所有铭文。

The actual code lives in the EDH_ETL repo, notebook 1_1_py_EXTRACTION_edh-inscriptions-from-web-api.ipynb.

真实代码在 EDH_ETL 仓库,笔记本文件为 1_1_py_EXTRACTION_edh-inscriptions-from-web-api.ipynb

EDCS wayLike reading the whole library by hand.

EDCS 方式逐页通读整座图书馆

EDCS has no vending machine. A scraper visits every Roman province's page, one at a time. ~4–5 hours.

EDCS 没有自动售货机。一个 网页抓取工具 (scraper) 逐个访问每个罗马行省的页面,约需 4–5 小时。

Idle. Click "Start the scraper".空闲。点击"启动抓取"。
What just happened刚才发生了什么

You ran a simulated web scrape across 18 Roman provinces. The real scraper is Lat Epig 2.0, a Docker tool from Macquarie University that wraps EDCS's public search interface and saves one TSV per province. A full scrape takes 4–5 hours and produces ~537,000 cleaned Latin inscriptions.

你刚刚模拟了对 18 个罗马行省的 网页抓取 (web scrape)。实际工具是 Macquarie 大学开发的 Lat Epig 2.0,一个 Docker 容器化的工具,将 EDCS 的公共搜索接口包装起来,每个行省保存为一个 TSV 文件。完整抓取约需 4–5 小时,产生约 537,000 条清洗后的拉丁文铭文。

SDAM doesn't write the scraper — they consume its output. The instructions for running it (clone repo, switch branch, bash dockerScraperAll.sh) live in the EDCS_ETL README.

SDAM 并不编写这个抓取工具 —— 而是使用其输出。运行说明(克隆仓库、切换分支、bash dockerScraperAll.sh)记录在 EDCS_ETL 的 README 中。

Stage 2 — TransformClean it. Make it consistent.

第二阶段 —— 转换清洗。使之规范。

Raw data is messy. One inscription says "AD 100", another "100 CE", a third "around the 2nd century". Are they the same? The transform stage decides.

原始数据杂乱无章。一条铭文写"AD 100",另一条写"100 CE",第三条写"约 2 世纪"。它们是同一回事吗?转换 (transform) 阶段来决定。

This is the most intellectual stage. Every cleaning rule is a scholarly choice — and SDAM keeps the original alongside the cleaned version so you can always see what changed.

这是最具思辨性的阶段。每一条清洗规则都是 学术抉择 (scholarly choice) —— SDAM 在保留清洗版本的同时也保留原始版本,让任何人随时可对比变更。

Transform deep dive in Reference参考版深入 (Transform)
🧹
Goal目标

Turn many inconsistent records into one consistent table that a computer (and a historian) can analyze.

把众多不一致的记录变成一张一致的表格,方便计算机(和历史学家)分析。

Try the cleaningClick each chip to apply a fix.

体验清洗点击每个标签 应用一次清洗

idHD012345
date"around the middle of the 2nd c. CE"
place" Carthago "
type"sepulcralis"
textD(is) [M(anibus)] / Iuliae [- - -] / vix(it) ann(os) XX
Click any fix above ↑点击上方任一清洗 ↑
What that fix does这步清洗做了什么

When two databases overlapThe same inscription, in two places.

两个数据库重叠时同一条铭文 出现在两处

Many inscriptions appear in both EDH and EDCS. The LI pipeline finds duplicates and merges them — using a machine-learning classifier to harmonize incompatible inscription-type taxonomies.

许多铭文同时出现在 EDH 和 EDCS 中。LI 流水线找出重复条目并合并 —— 借助机器学习分类器 (ML classifier) 统一两边互不兼容的铭文类型分类。

From EDH来自 EDH
id: HD012345
date: 130–170 CE
place: Carthage
type: epitaph
→ ←
From EDCS来自 EDCS
id: EDCS-50001234
date: ca. 150 CE
place: Carthago
type: sepulcralis
What just happened刚才发生了什么

Two records — one EDH, one EDCS — were matched by CIL/AE reference, geographic proximity, date overlap, and text similarity. Their attributes were merged column-by-column. The inscription type was unified by an ML classifier trained on overlap inscriptions; the EDH-style label "epitaph" replaced EDCS's Latin "sepulcralis", with confidence p=0.94.

两条记录 —— 一条来自 EDH,一条来自 EDCS —— 通过 CIL/AE 编号、地理邻近、日期重叠、文本相似度匹配。属性逐列合并。铭文类型由一个在重叠铭文上训练的机器学习分类器统一:EDH 风格的标签 "epitaph" 取代了 EDCS 的拉丁标签 "sepulcralis",置信度 p=0.94

The result is one row in the merged corpus called LIST (Latin Inscriptions in Space and Time). A spatio-temporally restricted subset called LIRE is published separately.

结果是合并语料 LIST(Latin Inscriptions in Space and Time,空间与时间中的拉丁文铭文)中的一行。另有一个时空范围更严格的子集 LIRE(Latin Inscriptions of the Roman Empire)单独发布。

Stage 3 — LoadPut it where people can find it.

第三阶段 —— 加载放在 大家能找到的地方

Every clean dataset goes to two places. Click each for details.

每一份干净数据集去往两个地方。点击查看详情。

💾

sciencedata.dk

working storage工作存储

The team's shared cloud drive. Anyone can read the public folder.

团队的共享云盘。公共文件夹任何人都能读取。

"Like a shared Google Drive — easy to update, easy to read."
"像共享的 Google Drive —— 易更新、易读取。"
🏛️

Zenodo

permanent archive永久存档

Each release gets a DOI — a permanent web address.

每次发布都获得 DOI —— 永久不变的网址。

"Like a book with an ISBN — frozen, dated, citable forever."
"像有 ISBN 的书 —— 冻结、有版本号、永远可引用。"
Load mechanics in Reference参考版加载细节

The full journeyWatch one inscription travel through.

完整旅程看一条铭文 逐阶段穿行

A 2nd-century stele from Africa Proconsularis. Press play.

一块来自阿非利加(Africa Proconsularis)行省的 2 世纪石碑。按播放键。

EXTRACT
TRANSFORM
LOAD
HD_001234 (raw)
Click "Play" to begin.点击"播放"开始。

Beyond ETL — tempunWhat if the date is fuzzy?

ETL 之外 —— tempun如果日期是 模糊的 怎么办?

Most inscriptions aren't dated to a year — only to a range. Naive midpoint dating creates fake spikes. tempun uses Monte Carlo simulation to honestly spread probability across each range — propagating uncertainty as FAIR data principles require.

大多数铭文没有精确到年的日期 —— 只有一个范围。简单地用中点会造出虚假的峰值。tempun 使用 蒙特卡洛 (Monte Carlo) 模拟,把概率诚实地分布在每个区间内 —— 这正是 FAIR 数据原则要求的"传播不确定性"。

100 CE 150 CE 200 CE
A single tall bar at year 150. Looks confident — too confident.一根高高的柱子立在 150 年。看起来很有把握 —— 但太自信了。

Beyond Latin inscriptionsThe same idea, applied to six corpora.

不止拉丁文铭文同一思路应用于 六个语料库

Click any card to see what it covers, where to download it, and what makes it different.

点击任意卡片查看其覆盖范围、下载位置以及独特之处。

🪨

EDH

Latin · inscriptions · API拉丁文 · 铭文 · API

Heidelberg, accessed via API + EpiDoc XML.

海德堡,通过 API + EpiDoc XML 获取。

~81,883
🪨

EDCS

Latin · inscriptions · scraper拉丁文 · 铭文 · 抓取

Clauss/Slaby, scraped via Lat Epig.

克劳斯/斯拉比,通过 Lat Epig 抓取。

~537,286
🔗

LI (LIST/LIRE)

Latin · merged · ML拉丁文 · 合并 · 机器学习

EDH + EDCS deduped, types harmonized.

EDH + EDCS 去重,类型统一。

parquet
📜

GI_ETL (PHI)

Greek · inscriptions · CSVs希腊文 · 铭文 · CSV

PHI Greek inscriptions, enriched via Trismegistos.

PHI 希腊铭文集,通过 Trismegistos 增强。

~200k+
📚

LAGT

Greek · texts · lemmatized希腊文 · 文献 · 词形还原

1,958 Greek works, 35M tokens, lemmatized.

1,958 部希腊文作品,3500 万词元,已词形还原。

v5.1 · parquet
⛰️

mounds_ETL

Bulgaria · burial mounds保加利亚 · 坟丘

Same ETL discipline, applied to archaeology.

同样的 ETL 方法,用于考古学。

archaeology考古学

What gets built on topThe cleaned data is just the start.

在数据之上清洗好的数据 只是开始

Every research project is its own open repo. Click any card to open its GitHub.

每个研究项目都是一个独立开源仓库。点击任意卡片打开其 GitHub。

📈
The "epigraphic habit""铭文习俗"研究

Macro-historical analysis: rise and fall of inscription production.

宏观史分析:铭文生产的兴衰曲线。

digital_epigraphy · JDH 2021
🛣️
Roads from inscriptions从铭文推断道路

Reconstructing ancient road networks from inscription distribution.

从铭文分布重构古代道路网络。

epigraphic_roads
📝
Epigraphic formulae铭文套语

Quantitative analysis of recurring phrasing (D.M., vixit annis).

对常用套语(D.M.、vixit annis)的定量分析。

formulae
🤖
NLP on inscriptions铭文自然语言处理

NLP experiments on inscription text. Connected Past 2021.

铭文文本的 NLP 实验。Connected Past 2021。

NLP_inscriptions
👥
Roman labor diversity罗马城市分工

Division of labor and occupational specialization across Roman cities.

罗马帝国各城市的分工与职业专业化。

social_diversity
🪙
Coins钱币学

Numismatic exploration applying SDAM data discipline.

将 SDAM 数据规范应用于钱币研究。

coins
🏔️
Landscape prominence景观可视性

R functions calculating site visibility — the Roman field of view.

R 函数计算遗址间可视性 —— 罗马人的视野。

landscape_prominence
🚶
Foot travel步行旅行时间

Mediterranean-scale pedestrian travel times between settlements.

地中海尺度的聚落间步行时间。

landscape-travel
🔍
OCR + cleaningOCR 与清洗

Character recognition and pre-cleaning of inscription texts.

铭文文本的字符识别与预清洗。

OCR · epigraphic_cleaning
Full catalog in Reference参考版完整目录

The connector packagesTwo libraries make all of this easy to use.

Wrappers around the ugly parts (auth, file paths, dating math) so notebooks stay short.

辅助包两个库让以上一切 易于使用

将麻烦的部分(鉴权、文件路径、日期计算)封装起来,让笔记本保持简短。

sddk

Python · MIT · PyPI
pip install sddk

One-line access to sciencedata.dk: read and write JSON/CSV/parquet from a Danish national research-data folder as if it were local.

一行代码访问 sciencedata.dk:把丹麦国家科研数据中心的文件夹当本地用,可读写 JSON/CSV/parquet。

Click for details ↗点击查看详情 ↗

sdam (R)

R · CC-BY · CRAN
install.packages("sdam")

R toolkit with built-in EDH dataset, place-name maps, and a prex() function for "probability of existence" — bin any inscription's date range.

R 工具包,内置 EDH 数据集、地名地图,并提供 prex() 函数计算"存在概率"——将任何铭文的日期范围分箱到时段中。

Click for details ↗点击查看详情 ↗

Both packages, in detail两个包的详细说明

What gets publishedThe flagship paper, three layers.

最终成果那篇旗舰论文 —— 三层式

Heřmánková, Kaše & Sobotková (2021) appeared in the inaugural issue of the Journal of Digital History, which uses a deliberate three-layer publication format. The paper revisits the epigraphic habit thesis at full corpus scale. Click each layer.

Heřmánková, Kaše & Sobotková 三人 2021 年的论文,发表在 《数字史学杂志》 创刊号。该刊采用刻意设计的三层式发表形态;该论文在全语料尺度下重审 "铭文习俗" 命题。点击每一层。

Layer 1 · Narrative第一层 · 叙事
The article you read读者读到的"那篇论文"
Prose, figures, citations — what humanists recognize as a journal article.
散文、图表、引证 —— 人文学者熟悉的"期刊论文"形态。
Layer 2 · Hermeneutic第二层 · 诠释
Methodological reflection方法论自省
A parallel commentary on how the work was done — choices and rationales.
关于"如何"完成研究的并行评论 —— 抉择与理由。
Layer 3 · Data第三层 · 数据
Executable Jupyter notebooks可执行的 Jupyter 笔记本
The actual code that produced every figure. Re-run, change a parameter, get a different chart.
生成每一张图的真实代码。重新运行、改参数、得到不同的图。
Paper walkthrough论文导读 Read on JDH在 JDH 阅读 Companion notebook配套笔记本

The article in 7 chartsSeven figures, one argument.

用 7 张图讲一篇论文七张图,一个论点

Click any tile to see how that figure shapes the paper's claim.

点击任意卡片,查看该图如何支撑论文的论点。

Each figure is reproduced and explained in the Paper Edition. The actual PNGs are in the digital_epigraphy repo, generated by the companion notebook.

每张图在论文版中都有重绘与解读。原始 PNG 在 digital_epigraphy 仓库中,由 配套笔记本 生成。

Now zoom inOne stone, five databases.

放大到一件文物一块石头,五个数据库

A bilingual marble plaque from Roman Sicily — ISic000470, a stonecutter's shop sign in Greek and archaic Latin — appears across five major epigraphic databases. Each says something different about it.

罗马时代西西里的一块双语大理石广告板 —— ISic000470,用希腊文与古风拉丁文写就的石匠铺招牌 —— 同时被五个主要铭文数据库收录,每一个所说都略有不同。

Walk through the full case study完整案例研究

The same stonehas eleven names.

同一块石头十一个名字

Four database IDs (one repeated), one DOI, six print-corpus references, two epigraphic-bulletin clusters. The pulsing chips are the same physical inscription listed twice in PHI.

四个数据库 ID(其中一个重复)、一个 DOI、六个印本参引、两组铭文学公报参引。闪动的两个胶囊就是 PHI 把同一块石头记成两条的那两个 ID。

I.Sicily ISic000470 EDR 140617 EDCS 22000882 PHI 175744 PHI 140601 TM 491798 DOI zenodo.4337543 CIL X 7296 IG XIV 297 ILS 7680 IGR 1.503 CIG 5554 IGLPalermo 139 ILMusPalermo 74 AE 2000.643 SEG 39.1017

Without a multi-way crosswalk, automated dedup over-counts this single inscription up to 5×. Only I.Sicily's TEI publicationStmt records all five external IDs in one place.

没有多向对照表,自动去重最多会把这块石头计 5 次。只有 I.Sicily 的 TEI publicationStmt 在一处同时记录全部五个外部编号。

Why this hinders macro-history这如何阻碍宏观史

Eight transcriptions deepThe same stone in 1883 vs 1890.

第八次转录同一块石头:1883 vs 1890

Before any database existed, this inscription had been printed in two great 19th-century corpora. Each editor made different choices. The five modern databases inherited some of those choices and silently dropped others.

在任何数据库出现之前,这块铭文已两次进入 19 世纪的印本丛刊。两位编者做出不同的取舍 —— 现代五库继承了一部分,悄悄丢掉了另一部分。

7296 CIL X · Mommsen 1883
originis incertae, sed vix urbanae; immo Siculam originem prodit quod bilinguis est, cum litterae sint optimae aetatis marmorarium commendantes…
ϹΤΗΛΑΙ
ΕΝΘΑΔΕ
ΤΥΠΟΥΝΤΑΙ ΚΑΙ
ΧΑΡΑϹϹΟΝΤΑΙ
ΝΑΟΙϹ ΙΕΡΟΙϹ
ϹΥΝ ΕΝΕΡΓΕΙΑΙϹ
ΔΗΜΟϹΙΑΙϹ
TITVLI
HEIC
ORDINANTVR ET
SCVLPVNTVR
AIDIBVS SACREIS
CVM OPERVM
PVBLICORVM
Marmorarius hic utriusque linguae infantiam prae se fert…
297 IG XIV · Kaibel 1890
Panormi in museo universitatis TORR. Lapis…est originis incertae; 'Siculam originem prodit, quod bilinguis est' MOMMSEN.
ϹΤΗΛΑΙ
ΕΝΘΑΔΕ
ΤΥΠΟΥΝΤΑΙΚΑΙ
ΧΑΡΑϹϹΟΝΤΑΙ
ΝΑΟΙϹΙΕΡΟΙϹ
ϹΥΝΕΝΕΡΓΕΙΑΙϹ
ΔΗΜΟϹΙΑΙϹ
TITVLI
HEIC
ORDINANTVRET
SCVLPVNTVR
AIDIBVSSACREIS
CVMOPERVM
PVBLICORVM
Στῆλαι ἐνϑάδε τυποῦνται καὶ χαράσσονται ναοῖς ἱεροῖς σὺν ἐνεργείαις δημοσίαις.
Marmorarius nec Graecus opinor nec Romanus homo…
The "QVM → CVM" silent fix. "QVM → CVM" 的悄然修正 Both editors normalize the archaic QVM on the stone to CVM. EDCS, EDR, and PHI 140601 all inherit this 19th-c. fix. Only I.Sicily's <choice> markup keeps both forms. 两位编者都把石上古拼写 QVM 默默规范为 CVM。EDCS、EDR、PHI 140601 都继承了这个 19 世纪修正。只有 I.Sicily 的 <choice> 标记同时保留两种形式。
Kaibel's normalized Greek line. Kaibel 的规范希腊文行 The single-line accented Greek (Στῆλαι ἐνθάδε…) is Kaibel's 1890 invention — and is the canonical text PHI uses today. PHI's "original" is four iterations away from the stone. 那一行带重音的希腊文(Στῆλαι ἐνθάδε…)是 Kaibel 1890 年的发明,也是今天 PHI 所用的"标准"文本。PHI 的"原文"距离石头已有四次转录。

Full paper-to-digital walkthrough完整的"纸到数字"导览

Two languages, one pipelineWhat's a Jupyter notebook? What's R?

一条流水线,两种语言什么是 Jupyter?什么是 R

Every SDAM ETL pipeline mixes both. The trick is knowing what each is good for.

SDAM 每条 ETL 流水线都混用两者。关键是知道各自擅长什么。

Jupyter Notebook · .ipynb

A document that runs.会运行的文档。

Code, prose, output, charts, citations — all in one browser-based file. Each cell runs on demand; outputs save inline. SDAM uses it for extraction (Python, calling APIs and parsing XML).

代码、散文、输出、图表、引证 —— 全在一个浏览器文件里。逐格按需执行,输出就地保存。SDAM 用它做提取(Python,调 API、解析 XML)。

Why it matters: the JDH paper's seven figures live inside one notebook — reopen it, re-run, get the same charts. Reproducibility ships with the document.

为何重要:JDH 论文那七张图就装在一个笔记本里 —— 重新打开、重新执行、得到同样的图。可复现性随文件一同发货。

R · .Rmd

A statistician's lingua franca.统计学者的通用语。

A language built for data analysis since 1993. Its tidyverse packages (dplyr, stringr, ggplot2) offer a clean syntax for row-by-row table operations. SDAM uses it for cleaning (Transform stage — coercing dates, normalizing text, harmonizing categories).

1993 年起为数据分析量身打造的语言。它的 tidyverse 包族(dplyr、stringr、ggplot2)提供了对表格行级操作的洁净语法。SDAM 用它做清洗(转换阶段 —— 强制类型、规范文本、统一分类)。

Why R for cleaning? R's pipe syntax (|>) chains regex substitutions cleanly: strip brackets → strip parens → collapse whitespace → done. The EDH 1_5 notebook is the showcase.

为何用 R 做清洗:R 的管道语法 |> 把正则替换链接得清爽:去括号 → 去圆括号 → 折叠空白 → 完成。EDH 1_5 笔记本 是范例。

Real code samples in Reference参考版的真实代码示例

Click a linewatch the zone light up.

点击一行分区亮起

Letter heights from the I.Sicily TEI: line 1 = 22 mm, line 2 = 20 mm, line 3 = 8 mm, lines 4–7 = 10 mm. The dramatic drop from line 2 to line 3 is itself a paleographic signal — but only this kind of zone-anchored facsimile can encode it.

来自 I.Sicily TEI 的字高:行 1 = 22 mm,行 2 = 20 mm,行 3 = 8 mm,行 4–7 = 10 mm。第 2 行到第 3 行的剧烈跌落本身就是一个字形学信号 —— 而只有这种带分区锚定的摹本才能编码它。

METRIC 1 2 3 4 5 10 ϹΤΗΛΑΙTITVLI ΕΝΘΑΔΕHEIC ΤΥΠΟΥΝΤΑΙ ΚΑΙORDINANTVR ET ΧΑΡΑϹϹΟΝΤΑΙSCVLPVNTVR ΝΑΟΙϹ ΙΕΡΟΙϹAIDIBVS SACREIS ϹΥΝ ΕΝΕΡΓΕΙΑΙϹQVM OPERVM ΔΗΜΟϹΙΑΙϹPVBLICORVM 22 mm 20 mm 8 mm 10 mm 10 mm 10 mm 10 mm

Full anchoring discussion完整文图锚定讨论

When you try to mergefive records, conflicts erupt.

尝试合并时五条记录 冲突四起

Press the button to feed all five database records into a naïve deduplication pipeline. Watch what breaks.

按下按钮,把五条记录喂给一个朴素的去重流水线 —— 看看什么会出问题。

I.Sicily
date: 1–200 CE
inv: 3574
w: 24.5
EDR
date: -100–100
inv: 8822
w: 14.5
EDCS
date: ?
inv: —
w: —
PHI 175744
date: 100–200
L12: qum
PHI 140601
date: undated
L12: cum
CONFLICT · date: 5 candidates spanning 100 BCE → 200 CE
CONFLICT · inventory: 3574 ≠ 8822
CONFLICT · width: 24.5 ≠ 14.5 cm
CONFLICT · transcription line 12: qum vs cum
WARNING · 2 PHI records resolve to same TM 491798
NOTICE · EDH absent — out of scope

This is exactly the kind of mess the JDH paper's deduplication step was designed to absorb — and exactly why it took half the article to document.

这正是 JDH 论文的去重步骤所要消化的局面 —— 也正是论文用一半篇幅来交代它的原因。

Hands-onUse the data in 60 seconds.

动手试用60 秒用上数据。

No clone, no Docker, no scraper. The finished datasets are on a public URL.

无需克隆、无需 Docker、无需抓取。成品数据集都挂在公开 URL 上。

The no-code way

免代码方式

Click any of these to download the cleaned dataset:

点击任意按钮下载已清洗数据集:

The two-line code way

两行代码方式

If you have Python:

如果你装了 Python:

import pandas as pd EDH = pd.read_json("https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/EDH_text_cleaned_2022_11_03.json") # 81,883 Latin inscriptions, ready

Or for Greek texts (LAGT):

希腊文文献(LAGT):

import pandas as pd LAGT = pd.read_parquet("https://zenodo.org/records/13889714/files/LAGT_v4-1.parquet?download=1")

Wrap-upThat's the SDAM ecosystem.

总结这就是 SDAM 生态。

E
Extract提取

APIs, scrapers, XML, CSVs

API、抓取、XML、CSV

T
Transform转换

Clean, dedupe, classify

清洗、去重、分类

L
Load加载

sciencedata.dk + Zenodo

sciencedata.dk + Zenodo

Plus tempun for uncertainty, 9+ analysis repos, two helper packages, six ETL pipelines, and 37 repositories total.

还有处理时间不确定性的 tempun9 个以上分析仓库、两个辅助包、六条 ETL 流水线,共 37 个仓库。

Where to go next

下一步