Get the SPDX License ID from license text.
A portable license ID matcher with command line interface and Python API.
Used as a license detection engine for Pitloom software bill of materilas generator.
- Hybrid matching pipeline:
- Tier 0.5 (Marker detection): Detects
SPDX-License-Identifiertags and structured markers (name fields, headings). An exact SPDX tag returns immediately with full confidence. - Tier 0 (Shortcut): Fast path for short inputs (names, IDs, brief
expressions). Includes:
- Case-insensitive exact ID match.
- Prose-context disambiguation for bare deprecated IDs (e.g.
"GPL-2.0 or later version"→GPL-2.0-or-later). - Conservative
-onlyfallback when no granting context is present.
- Tier 1 (Recall): Candidate retrieval using SQLite FTS5 trigram index,
capped at the first 100 query words for consistent performance.
Comment prefixes (
//,#,*,;) are stripped before querying. - Tier 2 (Precision): Adaptive ranking with RapidFuzz. Sliding-window alignment for fragments; coverage-aware scoring to prefer the tightest match. Marker confidence boosts applied only when confidence ≥ 0.85.
- Tier 3 (Validation): Optional final validation via
tools-java.
- Tier 0.5 (Marker detection): Detects
- Deprecated ID normalisation:
GPL-2.0+→GPL-2.0-or-later(SPDX+operator, unambiguous).Apache-2+→Apache-2.0+(abbreviated base canonicalised,+retained).- Bare deprecated IDs (e.g.
GPL-2.0) resolved conservatively to-onlywhen no surrounding context is available.
- Unix philosophy: Parseable, line-delimited CLI output.
Install with pipx:
pipx install licenseidOr using uv:
uv tool install licenseidBefore matching, you need to build the local license index:
licenseid updateAdvanced update options:
--version <version>: Download a specific SPDX License List version (e.g.,3.28.0).--force: Force update even if the local database is already at the target version.--no-cache: Bypass the local cache for downloads.
Identify license text from a file, an ID, or a string:
# From a file (smart detection)
licenseid match LICENSE.txt
# From an ID (smart detection)
licenseid match MIT
# From a string (smart detection / piped)
echo "MIT License..." | licenseid match
# Explicit ID lookup (fastest, skips similarity check)
licenseid match --id MITCommon options:
--db <path>: Use a custom database path (global option). Supports SQLite URIs for in-memory databases (e.g.,file:test?mode=memory&cache=shared).--id <id>: Explicitly treat input as an SPDX License ID (bypasses file/text matching).--bold: Print only the top license ID (no other info).--diff: Show a word-by-word diff between the input and the best-matching candidate.--json: Output results in JSON format.
The system uses a composite score (similarity + coverage bonus/penalty + optional popularity weight + marker confidence boost) to prefer the tightest match. For example, it distinguishes a short permissive licence from a superset that shares the same preamble.
licenseid maintains a local cache of remote data to save bandwidth.
licenses.json: Cached for 45 days.popularity.csv: Cached for 75 days.- SPDX data tarballs are versioned and never expire.
To clear the cache manually:
licenseid --clear-cacheDefault (Unix-friendly):
LICENSE_ID=Apache-2.0 SIMILARITY=0.9850 COVERAGE=1.0000
ID only:
licenseid match LICENSE.txt --boldExample output:
Apache-2.0
JSON:
licenseid match LICENSE.txt --jsonExample output:
[
{
"license_id": "Apache-2.0",
"score": 0.985,
"similarity": 0.985,
"coverage": 1.0,
"is_spdx": true,
"is_osi_approved": true
}
]Diff (visual comparison):
licenseid match LICENSE.txt --diffExample output:
LICENSE_ID=Apache-2.0 SIMILARITY=0.9980 COVERAGE=0.9975
WORD DIFF:
--- DATABASE
+++ INPUT
@@ -1601,8 +1601,4 @@
language
governing
permissions
-and
-limitations
-under
-the
-license
+seThe CLI follows standard Unix exit code conventions, making it suitable for use in scripts and CI/CD pipelines.
| Exit Code | Meaning | Scenarios |
|---|---|---|
| 0 | Success | Confident match found; predicate is TRUE; database updated or already up-to-date. |
| 1 | Logic Failure | No matching license found; predicate is FALSE; network error. |
| 2 | Usage Error | Missing subcommand; missing input text/file; invalid parameters. |
Predicate commands are designed for shell scripting. They print true/false and exit with 0 (for true) or 1 (for false).
| Command | Description |
|---|---|
is-spdx |
True if the license is in the SPDX License List. |
is-open |
True if the license is OSI-approved OR FSF-libre. |
is-free |
Alias for is-open. |
is-osi |
True if the license is OSI-approved. |
is-fsf |
True if the license is FSF-libre. |
Example usage in a script:
# Check by ID
if licenseid is-osi MIT; then
echo "This is an OSI-approved license."
fi
# Check by File
licenseid is-open LICENSE.txt || echo "Warning: Not an open source license"
# Check by Text (via stdin)
echo "MIT License..." | licenseid is-fsf && echo "FSF Libre!"You can use licenseid directly in your Python projects:
from licenseid.matcher import AggregatedLicenseMatcher
# Initialize with default database
matcher = AggregatedLicenseMatcher()
# 1. Match by Raw Text (Positional or Keyword)
# Programmatic API is explicit: positional 'text' is always treated as text.
results = matcher.match("Permission is hereby granted...")
results = matcher.match(text="Custom license text...")
# 2. Match by SPDX License ID (Explicit)
# This performs a fast database lookup and returns full metadata.
results = matcher.match(license_id="MIT")
# 3. Match by File Path (Explicit)
results = matcher.match(file_path="LICENSE.txt")
# 4. Predicates
# Supports keyword arguments for precise control.
if matcher.is_osi(license_id="MIT"):
print("OSI Approved!")
if matcher.is_open(file_path="LICENSE.txt"):
print("Open Source!")
if matcher.is_spdx(text="Creative Commons Zero v1.0 Universal"):
print("SPDX Match Found!")Example JSON output:
[
{
"license_id": "MIT",
"score": 1.01,
"similarity": 1.0,
"coverage": 0.0
}
]Regular test suite:
pytestRun benchmarks and accuracy tests (expensive):
pytest --run-benchmarkSPDX_TOOLS_JAR: Path to thetools-javajar for Tier 3 validation.
Apache-2.0