resume-extract

Fast, local resume extraction using a fine-tuned DistilBERT NER model. Extracts structured data from resume text, PDF, or DOCX via local document parsing + ONNX inference.

Installation

Binary (recommended):

curl -fsSL https://raw.githubusercontent.com/somus/resume-extract/main/scripts/install-release.sh | bash
resume-extract --help

The installer downloads the latest GitHub Release asset into ~/.local/bin. Override INSTALL_DIR, REPO, or VERSION if needed:

INSTALL_DIR=/usr/local/bin VERSION=v0.1.0 curl -fsSL https://raw.githubusercontent.com/somus/resume-extract/main/scripts/install-release.sh | bash

As library:

bun install

Build from source:

bun run build:bin
./dist/resume-extract --input ./resume.pdf --ats

Notes:

parseResume() is text-only fast path.
parseResumePdf() and parseResumeDocx() use @kreuzberg/node for local document text extraction.
parseResumePdf(..., { ocr: true }) enables OCR for scanned PDFs (defaults to Tesseract). Supports tesseract, easyocr, and paddleocr backends via { ocr: { backend: "easyocr" } }. OCR is much slower than text parsing.
On first run, the CLI automatically downloads the required oksomu/resume-ner model files into a local cache if they are missing and shows download progress. Pass --model to use a custom directory or --no-download to require a pre-populated model directory.
Library consumers should manage model directories explicitly.

Features

Structured extraction: name, email, phone, location, companies, titles, education, skills
Document input support: parse raw text, PDF, or DOCX
ATS scoring: completeness score with actionable issues list
Seniority inference: from job titles + years of experience
Country detection: from location + phone prefix
Experience years: computed from employment dates
Section-aware chunking: splits long resumes at paragraph boundaries for >512 token texts
Section detection: rule-based gap-filling for skills, certifications, and languages the model misses
100% local: runs offline via ONNX, no API calls
Fast text parsing: ~15ms per resume after model load
Optional document parsing: PDF via Kreuzberg, including OCR when enabled; DOCX via Kreuzberg

Model

Uses oksomu/resume-ner — a DistilBERT model fine-tuned for resume NER and exported to ONNX for local structured extraction.

Latest model metrics (from model card, noise-augmented, 25 epochs, entity-level exact-match via seqeval):

entity F1: 97.77%
structured micro F1: 97.88%
clean resume F1: 99.18%
noisy resume F1: 69.24% (OCR/scraped text)
quantized ONNX size: 63MB

Entity types:

NAME, EMAIL, PHONE, LOCATION, COMPANY, TITLE, DATE, DEGREE, INSTITUTION, FIELD, SKILL, CERT, LANGUAGE

Model directory should include:

resume_config.json — pre-processing, post-processing, and inference rules
companies.json — company gazetteer for post-processing
city_country_map.json — 317 cities for country inference
tokenizer/config files
onnx/model_quantized.onnx or onnx/model.onnx

Usage

import {
  computeATSScore,
  parseResume,
  parseResumeDocx,
  parseResumePdf,
} from "resume-extract";

const result = await parseResume(resumeText, "/path/to/model");
const fromPdf = await parseResumePdf("/path/to/resume.pdf", "/path/to/model");
const fromScannedPdf = await parseResumePdf(pdfBytes, "/path/to/model", { ocr: true });
const fromDocx = await parseResumeDocx("/path/to/resume.docx", "/path/to/model");

// result.personal: { name, email, phone, location }
// result.experience: [{ title, company, start_date, end_date }]
// result.education: [{ degree, field, institution }]
// result.skills: ["Python", "AWS", ...]
// result.seniority: "Senior"
// result.country: "India"
// result.experience_years: 10

const ats = computeATSScore(result);
// ats.score: 87
// ats.issues: [{ severity: "medium", message: "..." }]

CLI

Run directly with Bun:

 bun run cli ./resume.pdf --ats
 bun run cli --text "Jane Doe..."
 bun run cli ./resume.pdf --view json --output result.json
cat ./resume.txt | bun run cli

# Batch mode
bun run cli batch ./resumes/*.pdf --ats
bun run cli batch --input-dir ./resumes --glob '**/*' --output batch.jsonl
bun run cli batch --input-dir ./resumes --output batch.csv --output-format csv
bun run cli batch --input-dir ./resumes --fail-fast

# Explicit model setup and diagnostics
bun run cli setup-model
bun run cli doctor --ocr
bun run cli doctor --fix
bun run cli doctor --json

Common flags:

--model <path>: model directory
--model-repo <repo>: alternate Hugging Face repo for first-run download
--model-revision <rev>: alternate model revision for first-run download
--no-download: disable automatic model download
--input <path>: input file path
--text <text>: inline text input
--format <auto|text|pdf|docx>: override format detection
--ocr: enable PDF OCR (defaults to Tesseract)
--ocr-backend <backend>: OCR backend: tesseract, easyocr, or paddleocr
--ats: include ATS scoring in output
--view <json|pretty>: render machine JSON or human-friendly terminal output
--output <path>: write structured output to a file
--compact: emit minified JSON

Batch-only flags:

batch [inputs...]: process many resumes at once
--input-dir <path>: scan a directory for resumes
--glob <pattern>: file selection pattern for directory scanning
--concurrency <n>: parallel batch workers, defaults to 4
--fail-fast: stop batch processing on the first extraction error
--output-format <json|jsonl|csv>: structured batch output format

Extra commands:

setup-model: download the configured model into the local cache or custom --model path
update-model: pull the latest model from Hugging Face, re-downloading all files
doctor: inspect model readiness, file integrity, writable cache paths, runtime platform, and optional OCR availability
doctor --fix: download/repair the configured model, then report status
doctor --json: emit machine-readable diagnostics

The CLI checks for model updates once per day. If a newer model is available on Hugging Face, a warning is shown on stderr. Run update-model to pull the latest.

Output behavior:

Single resume commands default to pretty view on a TTY and json otherwise.
Batch commands default to pretty summaries on a TTY and structured JSON otherwise.
Use --view json when piping to other tools.
Use --output with batch plus --output-format jsonl for machine-friendly bulk processing.
Use --output-format csv when you want spreadsheet-friendly flat output with summary fields plus numbered experience and education columns.

Limitations

English resumes only
Max 512 tokens per chunk (section-aware chunking splits at paragraph boundaries for longer resumes)
Image-based/scanned PDFs require OCR before text extraction
Two-column PDF layouts may flatten during text extraction

Development

bun run test        # Run tests
bun run check       # Biome lint + format check
bun run typecheck   # TypeScript type check
bun run format      # Auto-format

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
lefthook.yml		lefthook.yml
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

resume-extract

Installation

Features

Model

Usage

CLI

Limitations

Development

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

resume-extract

Installation

Features

Model

Usage

CLI

Limitations

Development

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages