Skip to content

bact/licenseid

Repository files navigation

LicenseID - A portable SPDX License ID matcher

PyPI - Version GitHub License DOI

Get the SPDX License ID from license text.

A portable license ID matcher with command line interface and Python API.

Used as a license detection engine for Pitloom software bill of materilas generator.

Features

  • Hybrid matching pipeline:
    • Tier 0.5 (Marker detection): Detects SPDX-License-Identifier tags and structured markers (name fields, headings). An exact SPDX tag returns immediately with full confidence.
    • Tier 0 (Shortcut): Fast path for short inputs (names, IDs, brief expressions). Includes:
      • Case-insensitive exact ID match.
      • Prose-context disambiguation for bare deprecated IDs (e.g. "GPL-2.0 or later version"GPL-2.0-or-later).
      • Conservative -only fallback when no granting context is present.
    • Tier 1 (Recall): Candidate retrieval using SQLite FTS5 trigram index, capped at the first 100 query words for consistent performance. Comment prefixes (//, #, *, ;) are stripped before querying.
    • Tier 2 (Precision): Adaptive ranking with RapidFuzz. Sliding-window alignment for fragments; coverage-aware scoring to prefer the tightest match. Marker confidence boosts applied only when confidence ≥ 0.85.
    • Tier 3 (Validation): Optional final validation via tools-java.
  • Deprecated ID normalisation:
    • GPL-2.0+GPL-2.0-or-later (SPDX + operator, unambiguous).
    • Apache-2+Apache-2.0+ (abbreviated base canonicalised, + retained).
    • Bare deprecated IDs (e.g. GPL-2.0) resolved conservatively to -only when no surrounding context is available.
  • Unix philosophy: Parseable, line-delimited CLI output.

Installation

Install with pipx:

pipx install licenseid

Or using uv:

uv tool install licenseid

Usage

1. Update the license database

Before matching, you need to build the local license index:

licenseid update

Advanced update options:

  • --version <version>: Download a specific SPDX License List version (e.g., 3.28.0).
  • --force: Force update even if the local database is already at the target version.
  • --no-cache: Bypass the local cache for downloads.

2. Identify a license

Identify license text from a file, an ID, or a string:

# From a file (smart detection)
licenseid match LICENSE.txt

# From an ID (smart detection)
licenseid match MIT

# From a string (smart detection / piped)
echo "MIT License..." | licenseid match

# Explicit ID lookup (fastest, skips similarity check)
licenseid match --id MIT

Common options:

  • --db <path>: Use a custom database path (global option). Supports SQLite URIs for in-memory databases (e.g., file:test?mode=memory&cache=shared).
  • --id <id>: Explicitly treat input as an SPDX License ID (bypasses file/text matching).
  • --bold: Print only the top license ID (no other info).
  • --diff: Show a word-by-word diff between the input and the best-matching candidate.
  • --json: Output results in JSON format.

The system uses a composite score (similarity + coverage bonus/penalty + optional popularity weight + marker confidence boost) to prefer the tightest match. For example, it distinguishes a short permissive licence from a superset that shares the same preamble.

3. Cache management

licenseid maintains a local cache of remote data to save bandwidth.

  • licenses.json: Cached for 45 days.
  • popularity.csv: Cached for 75 days.
  • SPDX data tarballs are versioned and never expire.

To clear the cache manually:

licenseid --clear-cache

4. Output formats

Default (Unix-friendly):

LICENSE_ID=Apache-2.0 SIMILARITY=0.9850 COVERAGE=1.0000

ID only:

licenseid match LICENSE.txt --bold

Example output:

Apache-2.0

JSON:

licenseid match LICENSE.txt --json

Example output:

[
  {
    "license_id": "Apache-2.0",
    "score": 0.985,
    "similarity": 0.985,
    "coverage": 1.0,
    "is_spdx": true,
    "is_osi_approved": true
  }
]

Diff (visual comparison):

licenseid match LICENSE.txt --diff

Example output:

LICENSE_ID=Apache-2.0 SIMILARITY=0.9980 COVERAGE=0.9975

WORD DIFF:
--- DATABASE
+++ INPUT
@@ -1601,8 +1601,4 @@
 language
 governing
 permissions
-and
-limitations
-under
-the
-license
+se

5. Exit codes

The CLI follows standard Unix exit code conventions, making it suitable for use in scripts and CI/CD pipelines.

Exit Code Meaning Scenarios
0 Success Confident match found; predicate is TRUE; database updated or already up-to-date.
1 Logic Failure No matching license found; predicate is FALSE; network error.
2 Usage Error Missing subcommand; missing input text/file; invalid parameters.

6. License predicates (for CI/CD)

Predicate commands are designed for shell scripting. They print true/false and exit with 0 (for true) or 1 (for false).

Command Description
is-spdx True if the license is in the SPDX License List.
is-open True if the license is OSI-approved OR FSF-libre.
is-free Alias for is-open.
is-osi True if the license is OSI-approved.
is-fsf True if the license is FSF-libre.

Example usage in a script:

# Check by ID
if licenseid is-osi MIT; then
  echo "This is an OSI-approved license."
fi

# Check by File
licenseid is-open LICENSE.txt || echo "Warning: Not an open source license"

# Check by Text (via stdin)
echo "MIT License..." | licenseid is-fsf && echo "FSF Libre!"

Python API

You can use licenseid directly in your Python projects:

from licenseid.matcher import AggregatedLicenseMatcher

# Initialize with default database
matcher = AggregatedLicenseMatcher()

# 1. Match by Raw Text (Positional or Keyword)
# Programmatic API is explicit: positional 'text' is always treated as text.
results = matcher.match("Permission is hereby granted...")
results = matcher.match(text="Custom license text...")

# 2. Match by SPDX License ID (Explicit)
# This performs a fast database lookup and returns full metadata.
results = matcher.match(license_id="MIT")

# 3. Match by File Path (Explicit)
results = matcher.match(file_path="LICENSE.txt")

# 4. Predicates
# Supports keyword arguments for precise control.
if matcher.is_osi(license_id="MIT"):
    print("OSI Approved!")

if matcher.is_open(file_path="LICENSE.txt"):
    print("Open Source!")

if matcher.is_spdx(text="Creative Commons Zero v1.0 Universal"):
    print("SPDX Match Found!")

Example JSON output:

[
  {
    "license_id": "MIT",
    "score": 1.01,
    "similarity": 1.0,
    "coverage": 0.0
  }
]

Development

Running tests

Regular test suite:

pytest

Run benchmarks and accuracy tests (expensive):

pytest --run-benchmark

Configuration

  • SPDX_TOOLS_JAR: Path to the tools-java jar for Tier 3 validation.

License

Apache-2.0

Packages

 
 
 

Contributors

Languages