Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,20 @@ You can download the latest SQLite release of the [China Biographical Database](

Check [**latest.json**](https://github.com/cbdb-project/cbdb_sqlite/blob/master/latest.json) for the current release date, filename, SHA-256 checksum, and direct download URL.

## Post-processing (optional)

The raw database export does not include convenience views or the denormalised `ADDRESSES` table.
Use the scripts in [`scripts/`](./scripts/) to add them, or run the one-click Colab notebook:

| What you want | How to get it |
|---------------|---------------|
| Everything in one click | Open [`scripts/setup_cbdb.ipynb`](./scripts/setup_cbdb.ipynb) in Google Colab |
| Foreign key constraints | `python scripts/add_foreign_keys.py --db latest.db` |
| 18 convenience views | `bash scripts/create_views.sh latest.db` |
| `ADDRESSES` hierarchy table | `python scripts/create_addresses_table.py --db latest.db` |

See [`scripts/README.md`](./scripts/README.md) for full documentation.

## Data Limitations

* The ZZZ releases are now deprecated in favor of views. Use [`create_views.sh`](./scripts/create_views.sh) to create views in the SQLite file.
Expand Down
76 changes: 63 additions & 13 deletions scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,76 @@

[中文文档](./README.zh.md)

This directory contains the helper scripts used to download, normalise, and compare CBDB SQLite releases. The scripts themselves live in the project root so they can be executed directly without adjusting `PATH`; refer to their relative locations when running the commands below.
This directory contains helper scripts for downloading, post-processing, and comparing CBDB SQLite releases.

## Available Scripts

- `process_cbdb_dbs.sh`: end-to-end workflow that downloads the latest and historical SQLite dumps, unpacks them, applies the normalisation helpers, vacuums the databases, and generates a schema/data summary comparison.
- `compare_db_tables.py`: compares two SQLite databases table-by-table, emitting a report of schema and data discrepancies.
### One-stop notebook

- **`setup_cbdb.ipynb`** — Google Colab notebook that runs the full setup pipeline in one click:
downloads the latest database, adds foreign keys, creates views, and builds the `ADDRESSES` table.
Upload to [Google Colab](https://colab.research.google.com/) and click **Runtime → Run all**.
Each step can be toggled on or off via boolean flags in the *Configuration* cell.

### Individual scripts

| Script | Description |
|--------|-------------|
| `add_foreign_keys.py` | Fetches `foreign_keys_regen.csv` from GitHub and recreates SQLite tables with proper `FOREIGN KEY` constraints. Skips tables that already have FK constraints (idempotent). |
| `create_views.sh` | Creates 18 convenience SQL views (e.g. `View_PeopleData`, `View_EntryData`, `View_PostingOfficeData`). |
| `create_addresses_table.py` | Builds the `ADDRESSES` table by resolving the full administrative hierarchy for each address across time, preserving gaps in the data. |
| `compare_db_tables.py` | Compares two SQLite databases table-by-table, emitting row-count and schema discrepancies. |
| `process_cbdb_dbs.sh` | End-to-end workflow: downloads the latest and a historical SQLite dump, unpacks them, vacuums both, and runs `compare_db_tables.py`. |

## Prerequisites

The scripts expect the following command line tools:
### For the Colab notebook (`setup_cbdb.ipynb`)

No local installation needed — just upload to Google Colab.

### For running scripts locally

| Tool | Required by |
|------|-------------|
| `python3` | `add_foreign_keys.py`, `create_addresses_table.py`, `compare_db_tables.py` |
| `sqlite3` CLI | `create_views.sh` |
| `bash` | `create_views.sh`, `process_cbdb_dbs.sh` |
| `wget`, `7z` | `process_cbdb_dbs.sh` |

`process_cbdb_dbs.sh` checks for missing tools at startup and exits early if any are absent.

## Usage

### Add foreign keys

```bash
python scripts/add_foreign_keys.py --db latest.db
```

Pass `--csv-url URL` to use a different branch of `foreign_keys_regen.csv`.

### Create views

```bash
bash scripts/create_views.sh latest.db
```

### Build ADDRESSES table

```bash
python scripts/create_addresses_table.py --db latest.db
```

### Compare two releases

- `wget`
- `7z`
- `sqlite3`
- `python3`
```bash
python scripts/compare_db_tables.py old.db new.db
```

Install the tools before running the scripts. `process_cbdb_dbs.sh` will perform a sanity check and exit early if any are missing.
### Download and compare historical releases

## Usage Notes
```bash
bash scripts/process_cbdb_dbs.sh
```

- Run the shell script from the repository root: `./process_cbdb_dbs.sh`.
- Both Python utilities accept `--help` for detailed argument listings.
- Intermediate downloads are written to a temporary directory and cleaned up automatically; resulting databases are created alongside the scripts.
Intermediate downloads are written to a temporary directory and cleaned up automatically.
75 changes: 62 additions & 13 deletions scripts/README.zh.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,74 @@
# CBDB 脚本说明

此目录包含用于 CBDB 项目的辅助脚本,脚本本体保留在仓库根目录,便于直接执行。运行时请根据下述说明使用相对路径调用对应文件
此目录包含用于下载、后处理及比较 CBDB SQLite 发布版本的辅助脚本

## 脚本一览

- `process_cbdb_dbs.sh`:完整流程脚本,负责下载最新与历史版 SQLite 数据库、解压、运行规范化工具、执行 `VACUUM`,并生成数据库差异报告。
- `compare_db_tables.py`:逐表对比两个 SQLite 数据库的结构与数据,输出差异摘要。
### 一键 Notebook

- **`setup_cbdb.ipynb`** — Google Colab Notebook,一键完成完整配置流程:下载最新数据库、添加外键、创建视图、生成 `ADDRESSES` 表。
上传至 [Google Colab](https://colab.research.google.com/) 后点击 **Runtime → Run all** 即可运行。
每个步骤均可在 *Configuration* 单元格中通过布尔变量单独开关。

### 独立脚本

| 脚本 | 说明 |
|------|------|
| `add_foreign_keys.py` | 从 GitHub 读取 `foreign_keys_regen.csv`,将缺少外键的 SQLite 表重建并补充 `FOREIGN KEY` 约束。已有外键的表会自动跳过(幂等操作)。 |
| `create_views.sh` | 创建 18 个便于查询的 SQL 视图(如 `View_PeopleData`、`View_EntryData`、`View_PostingOfficeData` 等)。 |
| `create_addresses_table.py` | 通过解析地址在各时间段内的行政区划层级关系,构建 `ADDRESSES` 表,并保留数据中的空缺时段。 |
| `compare_db_tables.py` | 逐表对比两个 SQLite 数据库的行数与结构,输出差异摘要。 |
| `process_cbdb_dbs.sh` | 完整流程脚本:下载最新版和某一历史版 SQLite 数据库,解压后执行 `VACUUM`,并调用 `compare_db_tables.py` 生成对比报告。 |

## 运行前提

请确认已安装以下命令行工具:
### Colab Notebook(`setup_cbdb.ipynb`)

无需本地安装,直接上传至 Google Colab 使用。

### 本地运行脚本

| 工具 | 所需脚本 |
|------|----------|
| `python3` | `add_foreign_keys.py`、`create_addresses_table.py`、`compare_db_tables.py` |
| `sqlite3` CLI | `create_views.sh` |
| `bash` | `create_views.sh`、`process_cbdb_dbs.sh` |
| `wget`、`7z` | `process_cbdb_dbs.sh` |

`process_cbdb_dbs.sh` 启动时会检查依赖,缺少工具时会直接报错退出。

## 使用方法

### 添加外键

```bash
python scripts/add_foreign_keys.py --db latest.db
```

可通过 `--csv-url URL` 指定其他分支的 `foreign_keys_regen.csv`。

### 创建视图

```bash
bash scripts/create_views.sh latest.db
```

### 生成 ADDRESSES 表

```bash
python scripts/create_addresses_table.py --db latest.db
```

### 比较两个发布版本

- `wget`
- `7z`
- `sqlite3`
- `python3`
```bash
python scripts/compare_db_tables.py old.db new.db
```

`process_cbdb_dbs.sh` 会在启动时检查依赖,缺失工具时会直接报错退出。
### 下载历史版本并对比

## 使用提示
```bash
bash scripts/process_cbdb_dbs.sh
```

- 从仓库根目录执行:`./process_cbdb_dbs.sh`。
- 两个 Python 工具均可通过 `--help` 查看详细参数。
- 脚本会创建临时目录存放下载文件并在结束时清理,生成的数据库位于脚本所在目录。
下载文件会写入临时目录,脚本结束后自动清理。
Loading
Loading