Count tokenizer tokens in local files or Hugging Face datasets, then generate a Markdown/JSON report with the total token count and distribution insights.
The primary output is always:
- Markdown summary:
Total tokens - JSON field:
summary_stats.total_tokens - Terminal summary:
Total tokens
pip install -r requirements.txt
pip install -e .For private or gated Hugging Face datasets, set HF_TOKEN in the environment or
in a local .env file:
HF_TOKEN=hf_xxxOptional PDF support:
pip install -e ".[pdf]"Run a small sample from a Hugging Face dataset:
token-counter \
--dataset costadev00/gutenberg-project-tokenweaver-cpt-2048 \
--config default \
--split train \
--field text \
--model Qwen/Qwen3-1.7B-Base \
--max-docs 100 \
--report reports/gutenberg_sample.md \
--report-json reports/gutenberg_sample.jsonSingle local Parquet:
token-counter \
--input data/dataset.parquet \
--format parquet \
--field textMultiple local Parquet shards:
token-counter \
--input "data/shards/*.parquet" \
--format parquet \
--field textUse quotes around globs so the CLI receives the pattern.
Remote or Hugging Face Parquet glob:
token-counter \
--input "hf://datasets/<org>/<dataset>@main/path/*.parquet" \
--format parquet \
--field textHugging Face dataset:
token-counter \
--dataset <org>/<dataset> \
--split train \
--field textHugging Face dataset with config/revision:
token-counter \
--dataset costadev00/gutenberg-project-tokenweaver-cpt-2048 \
--config default \
--split train \
--field text \
--model Qwen/Qwen3-1.7B-Base \
--report reports/gutenberg_project_tokenweaver_cpt_2048.md \
--report-json reports/gutenberg_project_tokenweaver_cpt_2048.jsonMultiple Hugging Face splits:
token-counter \
--dataset felipeoes/br_legislation \
--revision refs/convert/parquet \
--config default \
--splits valid invalid \
--field text_markdown \
--model Qwen/Qwen3-1.7B-Base \
--report reports/br_legislation.md \
--report-json reports/br_legislation.jsonThe module form is equivalent:
python -m token_counter.cli --dataset <org>/<dataset> --split train --field textBy default, token-counter writes:
reports/token_count_report.mdreports/token_count_report.jsonreports/token_count_report_distribution.png
Choose output paths:
token-counter \
--dataset <org>/<dataset> \
--field text \
--report reports/count.md \
--report-json reports/count.jsonDisable Markdown and write only JSON:
token-counter \
--dataset <org>/<dataset> \
--field text \
--report "" \
--report-json reports/count.jsonResume from the JSON checkpoint:
token-counter \
--dataset <org>/<dataset> \
--field text \
--report-json reports/count.json \
--resumeGenerate PDF:
token-counter \
--dataset <org>/<dataset> \
--field text \
--report reports/count.md \
--report-pdf| Flag | Use |
|---|---|
--input |
Local path, URL, local glob, or hf:// Parquet/JSONL input |
--dataset |
Hugging Face dataset id |
--format |
jsonl or parquet for --input |
--field |
Text column to tokenize, default text |
--model |
Tokenizer model, default Qwen/Qwen3-1.7B-Base |
--batch-size |
Documents per tokenizer batch, default 256 |
--max-docs |
Stop after N processed documents |
--config |
Hugging Face dataset config |
--revision |
Hugging Face dataset revision |
--split |
One Hugging Face split, repeatable |
--splits |
Multiple Hugging Face splits in one flag |
--resume |
Resume from --report-json checkpoint |
--report |
Markdown report path; "" disables Markdown |
--report-json |
JSON report/checkpoint path; "" disables JSON |
--report-pdf |
Generate a PDF next to the Markdown report |
Markdown reports include:
- Summary with
Total tokens - Run context
- Optional by-split table
- Distribution snapshot: mean, median, IQR, P95, P99, stddev
- Histogram plot and table
- Data quality metrics
- Performance metrics
JSON reports include:
summary_stats.total_tokensdistribution_statsdata_quality_statsperformance_statsby_splitfor multi-split runscheckpoint_statefor--resume