Please report security issues to nltk.team@gmail.com
NLTK includes a centralized I/O security module (nltk.pathsec) that
validates file paths, network URLs, and zip archives.
As of NLTK 3.10.0, strict enforcement is enabled by default
(ENFORCE=True). In normal operation, NLTK applies the stricter
pathsec policy unless a caller explicitly opts out.
Under enforcement, unauthorized file access, SSRF attempts, and zip-slip
style path escapes raise exceptions (typically PermissionError) instead of emitting warnings.
NLTK's resource-loading protections are designed to reduce common risks when NLTK is used with untrusted input or in shared environments such as web applications, services, notebooks, CI/CD systems, and multi-tenant pipelines.
In particular, the current policy reduces the risk of:
- Arbitrary local file access through NLTK resource loading by requiring filesystem access to remain within allowed NLTK data directories.
- SSRF to non-public destinations by resolving network targets and blocking loopback, private, link-local, and multicast addresses.
- Redirect-based bypasses by re-validating redirects at each hop.
- Zip-slip attacks by validating extraction targets before writing files.
These protections apply to NLTK's own resource-loading paths and URL handling. They are not a general operating-system sandbox, and they do not prevent all unsafe behavior an application might perform outside NLTK.
file: URLs are not a general-purpose mechanism for loading arbitrary
local files.
With strict enforcement enabled (ENFORCE=True), file-backed resources
must resolve inside allowed NLTK data directories. By default these
directories are derived from:
nltk.data.path(configurable at runtime)NLTK_DATAenvironment variable- Standard locations (
~/nltk_data,/usr/share/nltk_data, etc.) - The system temp directory
If you use a custom resource directory, explicitly add it to
nltk.data.path:
import nltk
nltk.data.path.append('/my/custom/data')Then load resources by NLTK resource path rather than relying on access to arbitrary filesystem locations.
Implicit access to the current working directory is not allowed under
strict enforcement (ENFORCE=True) unless that directory has been
explicitly added to nltk.data.path.
If you intentionally want to trust the current directory, authorize it explicitly:
import nltk
nltk.data.path.append('.')This makes the trust decision explicit and avoids surprising behavior in server-side or shared execution environments.
NLTK permits network resource loading only for http: and https:
URLs.
Before a request is made, NLTK validates the resolved destination and blocks requests to:
- loopback addresses
- private RFC1918 ranges
- link-local addresses
- multicast addresses
Redirects are re-validated at each hop, so a public URL cannot bypass the policy by redirecting to a blocked destination.
In practice, ordinary public URLs continue to work, while destinations
such as 127.0.0.1, 10.0.0.0/8, and 169.254.169.254 are rejected.
- Path traversal: file access is validated against allowed NLTK
data directories (
nltk.data.path,NLTK_DATA, and standard system locations). - SSRF prevention:
urlopenresolves hostnames via DNS and blocks requests to loopback, private, link-local, and multicast IP ranges, including obfuscated forms where applicable. - Zip-slip protection: zip extraction validates that member paths stay within the target directory.
- Pickle safety:
nltk.data.load()usesRestrictedUnpicklerwhich blocks all class/function globals. Other pickle loading usespickle_load()which emits a security warning.
NLTK's corpus readers perform lexical path containment checks when joining file paths. These checks do not resolve symlinks. If your threat model includes attackers who can place symlinks inside trusted NLTK data directories, keep strict enforcement enabled so paths are fully resolved and validated.