goodmami · AmitMY · Jul 24, 2025 · Nov 19, 2025 · Dec 16, 2025 · Dec 16, 2025
diff --git a/.github/workflows/publish-docker.yaml b/.github/workflows/publish-docker.yaml
@@ -0,0 +1,60 @@
+# Adapted from https://docs.github.com/en/actions/tutorials/publishing-packages/publishing-docker-images
+name: Publish a Docker image
+
+# Configures this workflow to run every time a new release is created in the repository.
+on:
+  release:
+    types: [ created ]
+
+# Defines two custom environment variables for the workflow. These are used for the Container registry domain, and a name for the Docker image that this workflow builds.
+env:
+  REGISTRY: ghcr.io
+  IMAGE_NAME: ${{ github.repository }}
+
+# There is a single job in this workflow. It's configured to run on the latest available version of Ubuntu.
+jobs:
+  build-and-push-image:
+    runs-on: ubuntu-latest
+    # Sets the permissions granted to the `GITHUB_TOKEN` for the actions in this job.
+    permissions:
+      contents: read
+      packages: write
+      attestations: write
+      id-token: write
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+      # Uses the `docker/login-action` action to log in to the Container registry registry using the account and password that will publish the packages. Once published, the packages are scoped to the account defined here.
+      - name: Log in to the Container registry
+        uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      # This step uses [docker/metadata-action](https://github.com/docker/metadata-action#about) to extract tags and labels that will be applied to the specified image. The `id` "meta" allows the output of this step to be referenced in a subsequent step. The `images` value provides the base name for the tags and labels.
+      - name: Extract metadata (tags, labels) for Docker
+        id: meta
+        uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
+        with:
+          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
+      # This step uses the `docker/build-push-action` action to build the image, based on your repository's `Dockerfile`. If the build succeeds, it pushes the image to GitHub Packages.
+      # It uses the `context` parameter to define the build's context as the set of files located in the specified path. For more information, see [Usage](https://github.com/docker/build-push-action#usage) in the README of the `docker/build-push-action` repository.
+      # It uses the `tags` and `labels` parameters to tag and label the image with the output from the "meta" step.
+      - name: Build and push Docker image
+        id: push
+        uses: docker/build-push-action@f2a1d5e99d037542a71f64918e516c093c6f3fc4
+        with:
+          context: .
+          push: true
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+
+      # This step generates an artifact attestation for the image, which is an unforgeable statement about where and how it was built. It increases supply chain security for people who consume the image. For more information, see [Using artifact attestations to establish provenance for builds](/actions/security-guides/using-artifact-attestations-to-establish-provenance-for-builds).
+      - name: Generate artifact attestation
+        uses: actions/attest-build-provenance@v2
+        with:
+          subject-name: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME}}
+          subject-digest: ${{ steps.push.outputs.digest }}
+          push-to-registry: true
+
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -1,6 +1,7 @@
 name: Build and Publish to PyPI or TestPyPI
 
-on: push
+on:
+  workflow_dispatch:
 
 jobs:
   build:

diff --git a/.gitignore b/.gitignore
@@ -71,4 +71,6 @@ dmypy.json
 .vscode/
 
 # benchmarking results
-.benchmarks/
+.benchmarks/
+
+.claude/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,94 @@
 ## [Unreleased][unreleased]
 
 
+## [v1.3.0]
+
+**Release date: 2026-05-20**
+
+### Added
+
+* `<Synset>` elements in the Wikidata extension XMLs now carry
+  `<Example>` tags, mirrored from their senses. Usage examples now
+  surface through the wn web API on synset endpoints and via
+  `included` in `/words` — e.g. L482 ("you") now returns
+  `"You, in the red shirt: what's your name?"` at
+  `/lexicons/omw-en:1.4/synsets/wikidata-en-L482-S1`.
+* `extensions/wikidata-lexemes/create_extensions.py` falls back to
+  other-language glosses when the lemma language has no gloss
+  (lemma language → English → any available). Eliminates empty
+  `<Definition>` tags across the 130 shipped XMLs.
+
+### Docs
+
+* README documents the `docker build` workaround for hosts that
+  inject IPv6 CIDR entries into `NO_PROXY` (e.g. OrbStack), which
+  breaks `httpx` during the in-image `wn download` step.
+
+### Fixed
+
+* Ruff `UP038` in `wn/_config.py`: replaced `isinstance(x, (str, Path))`
+  with the PEP-604 union form.
+
+
+## [v1.2.1]
+
+**Release date: 2026-05-20**
+
+### Changed
+
+* Expanded `extensions/wikidata-lexemes/_pos_map.py` with ~120 new label
+  mappings so the regenerated extension XMLs use standard POS codes for
+  noun/verb/adjective/adverb/conjunction/adposition/numeral/pronoun/
+  determiner/particle/phrase variants from Wikidata. The `x` bucket now
+  only holds genuinely non-POS labels (roots, plurals, sub-word affixes,
+  onomatopoeia, etc.). 5,195 entries across 66 languages moved from `x`
+  to a typed code.
+
+### Fixed
+
+* Normalize unmapped POS values in the shipped extension XMLs to `x`
+  (`OTHER`) to match `create_extensions.py`'s contract — previously
+  ~150 verbose Wikidata labels were stored as-is and ended up outside
+  `wn.constants.PARTS_OF_SPEECH`. Reported by Cursor Bugbot on PR #1.
+* Lint failures on main: import-order in `wn/web.py` and two over-long
+  schema-hash comments in `wn/_db.py`.
+
+
+## [v1.2.0]
+
+**Release date: 2026-05-20**
+
+Sign-language fork release built on top of upstream v1.1.0.
+
+### Added
+
+* Six function-word parts of speech in `wn.constants.PARTS_OF_SPEECH`:
+  `h` (pronoun), `d` (determiner), `m` (numeral), `i` (interjection),
+  `q` (interrogative), `y` (particle).
+* `extensions/wikidata-lexemes/merge_extension.py`: utility that absorbs
+  a lexicon extension into its base lexicon, leaving one lexicon per
+  language in the database (addresses goodmami/wn#304 for this fork).
+* `extensions/wikidata-lexemes/_pos_map.py`: shared Wikidata-label →
+  short-code map used by the extension generator.
+* Web server now exposes pronunciation data on word forms (from
+  upstream's lexicon-element Pronunciation/Tag model).
+
+### Changed
+
+* Docker image now merges the Wikidata extension XMLs into their
+  base lexicons during build instead of loading them as separate
+  lexicons.
+* `extensions/wikidata-lexemes/create_extensions.py` emits short
+  POS codes via `_pos_map.POS_MAP` and writes `partOfSpeech` on
+  `<Synset>` elements.
+* All 130 generated extension XMLs regenerated to use the new
+  short codes.
+
+### Fixed
+
+* Starlette lifespan handler in `wn.web`.
+
+
 ## [v1.1.0]
 
 **Release date: 2026-03-21**

diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,40 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    python3-pip \
+    python3-dev \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install web server
+RUN pip install uvicorn
+
+COPY wn/ wn/
+COPY pyproject.toml README.md LICENSE ./
+RUN pip install --no-cache-dir ".[web]"
+
+# Download the wordnet data and initialize the database
+# CILI is for the Collaborative Interlingual Index
+# ODENET is for the German WordNet (linked to CILI)
+RUN python -m wn download omw:1.4 cili odenet:1.4
+
+# Load data extensions and merge them into their base lexicons so the
+# Open Multilingual Wordnet ends up with one lexicon per language.
+COPY extensions/wikidata-lexemes/output ./extensions/wikidata-lexemes/output
+COPY extensions/wikidata-lexemes/merge_extension.py ./extensions/wikidata-lexemes/merge_extension.py
+RUN python extensions/wikidata-lexemes/merge_extension.py extensions/wikidata-lexemes/output/*.xml
+
+# Run ANALYZE so SQLite has query planner statistics baked into the image
+RUN python -c "from wn._db import connect; c = connect(); c.execute('ANALYZE')"
+
+# Clean up the downloads directory
+RUN rm -r ~/.wn_data/downloads
+
+# Expose the port
+ENV PORT=8080
+EXPOSE 8080
+
+CMD ["sh", "-c", "uvicorn wn.web:app --host 0.0.0.0 --port $PORT"]
diff --git a/README.md b/README.md
@@ -46,6 +46,24 @@ uv add wn
 > >>> wn.reset_database()  # initialize without re-adding; start from scratch
 > ```
 
+Or using **docker**:
+
+```sh
+docker build -t wn .
+docker run -it -p 8080:8080 wn
+```
+
+> [!NOTE]
+> On hosts that inject IPv6 CIDR entries into `NO_PROXY` (e.g. OrbStack),
+> `httpx` fails to parse the proxy config during `wn download` in the
+> build. Override the proxy build args to work around it:
+> ```sh
+> docker build --network=host \
+>   --build-arg NO_PROXY= --build-arg no_proxy= \
+>   --build-arg HTTP_PROXY= --build-arg HTTPS_PROXY= \
+>   -t wn .
+> ```
+
 ## Getting Started
 
 First, download some data:

diff --git a/docs/api/wn.constants.rst b/docs/api/wn.constants.rst
@@ -254,6 +254,12 @@ Parts of Speech
    - ``p`` -- Adposition
    - ``x`` -- Other
    - ``u`` -- Unknown
+   - ``h`` -- Pronoun
+   - ``d`` -- Determiner
+   - ``m`` -- Numeral
+   - ``i`` -- Interjection
+   - ``q`` -- Interrogative
+   - ``y`` -- Particle
 
 .. autodata:: NOUN
 .. autodata:: VERB
@@ -280,6 +286,35 @@ Parts of Speech
 
 .. autodata:: OTHER
 .. autodata:: UNKNOWN
+.. autodata:: PRONOUN
+.. data:: PRON
+
+   Alias of :py:data:`PRONOUN`
+
+.. autodata:: DETERMINER
+.. data:: DET
+
+   Alias of :py:data:`DETERMINER`
+
+.. autodata:: NUMERAL
+.. data:: NUM
+
+   Alias of :py:data:`NUMERAL`
+
+.. autodata:: INTERJECTION
+.. data:: INTJ
+
+   Alias of :py:data:`INTERJECTION`
+
+.. autodata:: INTERROGATIVE
+.. data:: INTRG
+
+   Alias of :py:data:`INTERROGATIVE`
+
+.. autodata:: PARTICLE
+.. data:: PART
+
+   Alias of :py:data:`PARTICLE`
 
 
 Adjective Positions

diff --git a/extensions/wikidata-lexemes/.gitignore b/extensions/wikidata-lexemes/.gitignore
@@ -0,0 +1,2 @@
+latest-lexemes.json.bz2
+extras/
diff --git a/extensions/wikidata-lexemes/README.md b/extensions/wikidata-lexemes/README.md
@@ -0,0 +1,73 @@
+# Wikidata Lexemes
+
+Our multilingual wordnet covers nouns, verbs, adjectives, and adverbs well, but lacks function words (prepositions, conjunctions, determiners, pronouns, etc.).
+
+This module creates Global WordNet LMF extension files using Wikidata Lexemes to fill that gap.
+
+## Setup
+
+Install dependencies:
+
+```bash
+pip install ijson requests tqdm
+```
+
+Download the lexemes dump (~400MB):
+
+```bash
+curl -O https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.bz2
+```
+
+## Usage
+
+Run the extension generator:
+
+```bash
+python create_extensions.py
+```
+
+This will:
+1. Filter lexemes to exclude nouns, verbs, adjectives, adverbs, and phrases
+2. Build an interlingual index (ILI) linking senses across languages via English
+3. For lexemes with no Wikidata senses, fall back to the English Wiktionary REST API (filters out reference-only definitions, onomatopoeia, dialectal/archaic terms not covered by omw-en)
+4. Generate XML extension files in `output/` for each language
+
+Set `LANG_FILTER=en` to restrict generation to a single language while iterating.
+
+### Caching
+
+Web requests are cached on disk under `extras/` (gitignored):
+- `extras/wikidata/` — POS/language Q-code metadata
+- `extras/wiktionary/` — Wiktionary REST `definition` responses
+- `extras/wiktionary-cats/` — Wiktionary page categories (action API)
+
+To force a refresh of a cached entry, delete the corresponding file.
+
+## Output
+
+The script generates ~130 language-specific XML files in Global WordNet LMF format:
+- `extensions/en.xml` - English
+- `extensions/de.xml` - German
+- `extensions/ja.xml` - Japanese
+- etc.
+
+Each file contains lexical entries with:
+- Lemma and part of speech (short WN-LMF codes — `h` pronoun, `d` determiner, `m` numeral, `i` interjection, `q` interrogative, `y` particle, plus the existing `n/v/a/r/s/t/c/p/x`)
+- Sense definitions and glosses
+- Usage examples (where available)
+- Sense relations (synonyms, antonyms, hypernyms, hyponyms)
+- "Interlingual index" like links to English senses
+
+## Loading into the database
+
+`wn.add(file.xml)` stores an extension as a separate lexicon. To get one
+lexicon per language instead, use the bundled merge utility:
+
+```bash
+python merge_extension.py output/*.xml
+```
+
+It calls `wn.add` and then rewrites the database so the extension's
+entries, senses and synsets belong to the base lexicon (e.g. `omw-en:1.4`),
+and deletes the now-empty extension lexicon row. See
+[goodmami/wn#304](https://github.com/goodmami/wn/issues/304) for context.
diff --git a/extensions/wikidata-lexemes/__init__.py b/extensions/wikidata-lexemes/__init__.py
diff --git a/extensions/wikidata-lexemes/_omw_en.py b/extensions/wikidata-lexemes/_omw_en.py
@@ -0,0 +1,17 @@
+"""Cached omw-en lemma → POS coverage."""
+from functools import cache
+
+
+@cache
+def omw_en_pos() -> dict[str, frozenset[str]]:
+    """Return {lemma_lower: frozenset of WN POSes}. Empty if omw-en unavailable."""
+    try:
+        import wn
+        en = wn.Wordnet(lexicon="omw-en")
+    except Exception:
+        return {}
+    by_lemma: dict[str, set[str]] = {}
+    for word in en.words():
+        for form in (word.lemma(), *word.forms()):
+            by_lemma.setdefault(form.lower(), set()).add(word.pos)
+    return {lemma: frozenset(pos) for lemma, pos in by_lemma.items()}