Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
6914d3b
make wn deployable
AmitMY Jul 24, 2025
c11cede
feat(web): add definitions endpoint and faster forms
AmitMY Nov 19, 2025
062a54c
perf(web): optimize forms query with new database index
AmitMY Dec 16, 2025
6fd3e69
perf(web): add covering index for 55% faster forms query
AmitMY Dec 16, 2025
7cab3cf
feat(web): add with_entities filter to forms endpoint and improve Doc…
Jan 20, 2026
5b46491
Merge upstream/main, preserving web module
Jan 20, 2026
f7a073d
fix(db): add current schema hash after upstream merge
Jan 20, 2026
5ab43c0
feat: add Wikidata lexemes extension for function words
AmitMY Mar 1, 2026
ad0ec53
feat(web): add sense frequency count to word endpoint synsets
AmitMY Mar 1, 2026
4de8374
perf(web): add startup warmup and logging for forms endpoint
AmitMY Mar 1, 2026
9aca580
fix(test): update test_root to expect 200 from index endpoint
AmitMY Mar 1, 2026
b81f0a7
style: fix all ruff lint issues across the project
AmitMY Mar 1, 2026
581a71e
fix: disable auto-publish workflow and fix mypy ignores
AmitMY Mar 1, 2026
ca48e0a
perf(web): lower gzip compression level from 9 to 4
AmitMY Mar 1, 2026
d6aab62
perf: run ANALYZE at Docker build time for faster SQLite queries
AmitMY Mar 1, 2026
aaf058c
Merge remote-tracking branch 'upstream/main'
AmitMY Mar 31, 2026
e8090b8
feat(web): expose pronunciation data on word forms and fix Starlette …
AmitMY Mar 31, 2026
d856879
feat(extensions): merge Wikidata extensions into base lexicons + add …
AmitMY May 20, 2026
4d32f41
Merge pull request #1 from sign/feat/merge-wikidata-extensions
AmitMY May 20, 2026
7dc4e89
chore: bump version to v1.2.0
AmitMY May 20, 2026
e97b0c4
chore: fix lint failures on main
AmitMY May 20, 2026
1dfb6ef
fix(extensions): normalize unmapped POS values to OTHER ("x")
AmitMY May 20, 2026
02ed4ab
feat(extensions): map Wikidata POS variants to standard codes
AmitMY May 20, 2026
3cffbd2
chore: bump version to v1.2.1
AmitMY May 20, 2026
50ff100
feat(extensions): expose Wikidata examples on synsets + multilingual …
AmitMY May 20, 2026
d5dcff9
docs(README): note docker build workaround for IPv6 NO_PROXY
AmitMY May 20, 2026
9ca641c
style(_config): use PEP-604 union in isinstance check
AmitMY May 20, 2026
41d3227
chore: bump version to v1.3.0
AmitMY May 20, 2026
fd417f2
feat(wikidata-lexemes): forms, content-POS gap escape, quality filters
AmitMY May 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .github/workflows/publish-docker.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Adapted from https://docs.github.com/en/actions/tutorials/publishing-packages/publishing-docker-images
name: Publish a Docker image

# Configures this workflow to run every time a new release is created in the repository.
on:
release:
types: [ created ]

# Defines two custom environment variables for the workflow. These are used for the Container registry domain, and a name for the Docker image that this workflow builds.
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}

# There is a single job in this workflow. It's configured to run on the latest available version of Ubuntu.
jobs:
build-and-push-image:
runs-on: ubuntu-latest
# Sets the permissions granted to the `GITHUB_TOKEN` for the actions in this job.
permissions:
contents: read
packages: write
attestations: write
id-token: write

steps:
- name: Checkout repository
uses: actions/checkout@v4
# Uses the `docker/login-action` action to log in to the Container registry registry using the account and password that will publish the packages. Once published, the packages are scoped to the account defined here.
- name: Log in to the Container registry
uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
# This step uses [docker/metadata-action](https://github.com/docker/metadata-action#about) to extract tags and labels that will be applied to the specified image. The `id` "meta" allows the output of this step to be referenced in a subsequent step. The `images` value provides the base name for the tags and labels.
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
# This step uses the `docker/build-push-action` action to build the image, based on your repository's `Dockerfile`. If the build succeeds, it pushes the image to GitHub Packages.
# It uses the `context` parameter to define the build's context as the set of files located in the specified path. For more information, see [Usage](https://github.com/docker/build-push-action#usage) in the README of the `docker/build-push-action` repository.
# It uses the `tags` and `labels` parameters to tag and label the image with the output from the "meta" step.
- name: Build and push Docker image
id: push
uses: docker/build-push-action@f2a1d5e99d037542a71f64918e516c093c6f3fc4
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}

# This step generates an artifact attestation for the image, which is an unforgeable statement about where and how it was built. It increases supply chain security for people who consume the image. For more information, see [Using artifact attestations to establish provenance for builds](/actions/security-guides/using-artifact-attestations-to-establish-provenance-for-builds).
- name: Generate artifact attestation
uses: actions/attest-build-provenance@v2
with:
subject-name: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME}}
subject-digest: ${{ steps.push.outputs.digest }}
push-to-registry: true

3 changes: 2 additions & 1 deletion .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
name: Build and Publish to PyPI or TestPyPI

on: push
on:
workflow_dispatch:

jobs:
build:
Expand Down
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -71,4 +71,6 @@ dmypy.json
.vscode/

# benchmarking results
.benchmarks/
.benchmarks/

.claude/
88 changes: 88 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,94 @@
## [Unreleased][unreleased]


## [v1.3.0]

**Release date: 2026-05-20**

### Added

* `<Synset>` elements in the Wikidata extension XMLs now carry
`<Example>` tags, mirrored from their senses. Usage examples now
surface through the wn web API on synset endpoints and via
`included` in `/words` — e.g. L482 ("you") now returns
`"You, in the red shirt: what's your name?"` at
`/lexicons/omw-en:1.4/synsets/wikidata-en-L482-S1`.
* `extensions/wikidata-lexemes/create_extensions.py` falls back to
other-language glosses when the lemma language has no gloss
(lemma language → English → any available). Eliminates empty
`<Definition>` tags across the 130 shipped XMLs.

### Docs

* README documents the `docker build` workaround for hosts that
inject IPv6 CIDR entries into `NO_PROXY` (e.g. OrbStack), which
breaks `httpx` during the in-image `wn download` step.

### Fixed

* Ruff `UP038` in `wn/_config.py`: replaced `isinstance(x, (str, Path))`
with the PEP-604 union form.


## [v1.2.1]

**Release date: 2026-05-20**

### Changed

* Expanded `extensions/wikidata-lexemes/_pos_map.py` with ~120 new label
mappings so the regenerated extension XMLs use standard POS codes for
noun/verb/adjective/adverb/conjunction/adposition/numeral/pronoun/
determiner/particle/phrase variants from Wikidata. The `x` bucket now
only holds genuinely non-POS labels (roots, plurals, sub-word affixes,
onomatopoeia, etc.). 5,195 entries across 66 languages moved from `x`
to a typed code.

### Fixed

* Normalize unmapped POS values in the shipped extension XMLs to `x`
(`OTHER`) to match `create_extensions.py`'s contract — previously
~150 verbose Wikidata labels were stored as-is and ended up outside
`wn.constants.PARTS_OF_SPEECH`. Reported by Cursor Bugbot on PR #1.
* Lint failures on main: import-order in `wn/web.py` and two over-long
schema-hash comments in `wn/_db.py`.


## [v1.2.0]

**Release date: 2026-05-20**

Sign-language fork release built on top of upstream v1.1.0.

### Added

* Six function-word parts of speech in `wn.constants.PARTS_OF_SPEECH`:
`h` (pronoun), `d` (determiner), `m` (numeral), `i` (interjection),
`q` (interrogative), `y` (particle).
* `extensions/wikidata-lexemes/merge_extension.py`: utility that absorbs
a lexicon extension into its base lexicon, leaving one lexicon per
language in the database (addresses goodmami/wn#304 for this fork).
* `extensions/wikidata-lexemes/_pos_map.py`: shared Wikidata-label →
short-code map used by the extension generator.
* Web server now exposes pronunciation data on word forms (from
upstream's lexicon-element Pronunciation/Tag model).

### Changed

* Docker image now merges the Wikidata extension XMLs into their
base lexicons during build instead of loading them as separate
lexicons.
* `extensions/wikidata-lexemes/create_extensions.py` emits short
POS codes via `_pos_map.POS_MAP` and writes `partOfSpeech` on
`<Synset>` elements.
* All 130 generated extension XMLs regenerated to use the new
short codes.

### Fixed

* Starlette lifespan handler in `wn.web`.


## [v1.1.0]

**Release date: 2026-03-21**
Expand Down
40 changes: 40 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
python3-pip \
python3-dev \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Install web server
RUN pip install uvicorn

COPY wn/ wn/
COPY pyproject.toml README.md LICENSE ./
RUN pip install --no-cache-dir ".[web]"

# Download the wordnet data and initialize the database
# CILI is for the Collaborative Interlingual Index
# ODENET is for the German WordNet (linked to CILI)
RUN python -m wn download omw:1.4 cili odenet:1.4

# Load data extensions and merge them into their base lexicons so the
# Open Multilingual Wordnet ends up with one lexicon per language.
COPY extensions/wikidata-lexemes/output ./extensions/wikidata-lexemes/output
COPY extensions/wikidata-lexemes/merge_extension.py ./extensions/wikidata-lexemes/merge_extension.py
RUN python extensions/wikidata-lexemes/merge_extension.py extensions/wikidata-lexemes/output/*.xml

# Run ANALYZE so SQLite has query planner statistics baked into the image
RUN python -c "from wn._db import connect; c = connect(); c.execute('ANALYZE')"

# Clean up the downloads directory
RUN rm -r ~/.wn_data/downloads

# Expose the port
ENV PORT=8080
EXPOSE 8080

CMD ["sh", "-c", "uvicorn wn.web:app --host 0.0.0.0 --port $PORT"]
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,24 @@ uv add wn
> >>> wn.reset_database() # initialize without re-adding; start from scratch
> ```

Or using **docker**:

```sh
docker build -t wn .
docker run -it -p 8080:8080 wn
```

> [!NOTE]
> On hosts that inject IPv6 CIDR entries into `NO_PROXY` (e.g. OrbStack),
> `httpx` fails to parse the proxy config during `wn download` in the
> build. Override the proxy build args to work around it:
> ```sh
> docker build --network=host \
> --build-arg NO_PROXY= --build-arg no_proxy= \
> --build-arg HTTP_PROXY= --build-arg HTTPS_PROXY= \
> -t wn .
> ```

## Getting Started

First, download some data:
Expand Down
35 changes: 35 additions & 0 deletions docs/api/wn.constants.rst
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,12 @@ Parts of Speech
- ``p`` -- Adposition
- ``x`` -- Other
- ``u`` -- Unknown
- ``h`` -- Pronoun
- ``d`` -- Determiner
- ``m`` -- Numeral
- ``i`` -- Interjection
- ``q`` -- Interrogative
- ``y`` -- Particle

.. autodata:: NOUN
.. autodata:: VERB
Expand All @@ -280,6 +286,35 @@ Parts of Speech

.. autodata:: OTHER
.. autodata:: UNKNOWN
.. autodata:: PRONOUN
.. data:: PRON

Alias of :py:data:`PRONOUN`

.. autodata:: DETERMINER
.. data:: DET

Alias of :py:data:`DETERMINER`

.. autodata:: NUMERAL
.. data:: NUM

Alias of :py:data:`NUMERAL`

.. autodata:: INTERJECTION
.. data:: INTJ

Alias of :py:data:`INTERJECTION`

.. autodata:: INTERROGATIVE
.. data:: INTRG

Alias of :py:data:`INTERROGATIVE`

.. autodata:: PARTICLE
.. data:: PART

Alias of :py:data:`PARTICLE`


Adjective Positions
Expand Down
2 changes: 2 additions & 0 deletions extensions/wikidata-lexemes/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
latest-lexemes.json.bz2
extras/
73 changes: 73 additions & 0 deletions extensions/wikidata-lexemes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Wikidata Lexemes

Our multilingual wordnet covers nouns, verbs, adjectives, and adverbs well, but lacks function words (prepositions, conjunctions, determiners, pronouns, etc.).

This module creates Global WordNet LMF extension files using Wikidata Lexemes to fill that gap.

## Setup

Install dependencies:

```bash
pip install ijson requests tqdm
```

Download the lexemes dump (~400MB):

```bash
curl -O https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.bz2
```

## Usage

Run the extension generator:

```bash
python create_extensions.py
```

This will:
1. Filter lexemes to exclude nouns, verbs, adjectives, adverbs, and phrases
2. Build an interlingual index (ILI) linking senses across languages via English
3. For lexemes with no Wikidata senses, fall back to the English Wiktionary REST API (filters out reference-only definitions, onomatopoeia, dialectal/archaic terms not covered by omw-en)
4. Generate XML extension files in `output/` for each language

Set `LANG_FILTER=en` to restrict generation to a single language while iterating.

### Caching

Web requests are cached on disk under `extras/` (gitignored):
- `extras/wikidata/` — POS/language Q-code metadata
- `extras/wiktionary/` — Wiktionary REST `definition` responses
- `extras/wiktionary-cats/` — Wiktionary page categories (action API)

To force a refresh of a cached entry, delete the corresponding file.

## Output

The script generates ~130 language-specific XML files in Global WordNet LMF format:
- `extensions/en.xml` - English
- `extensions/de.xml` - German
- `extensions/ja.xml` - Japanese
- etc.

Each file contains lexical entries with:
- Lemma and part of speech (short WN-LMF codes — `h` pronoun, `d` determiner, `m` numeral, `i` interjection, `q` interrogative, `y` particle, plus the existing `n/v/a/r/s/t/c/p/x`)
- Sense definitions and glosses
- Usage examples (where available)
- Sense relations (synonyms, antonyms, hypernyms, hyponyms)
- "Interlingual index" like links to English senses

## Loading into the database

`wn.add(file.xml)` stores an extension as a separate lexicon. To get one
lexicon per language instead, use the bundled merge utility:

```bash
python merge_extension.py output/*.xml
```

It calls `wn.add` and then rewrites the database so the extension's
entries, senses and synsets belong to the base lexicon (e.g. `omw-en:1.4`),
and deletes the now-empty extension lexicon row. See
[goodmami/wn#304](https://github.com/goodmami/wn/issues/304) for context.
Empty file.
17 changes: 17 additions & 0 deletions extensions/wikidata-lexemes/_omw_en.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
"""Cached omw-en lemma → POS coverage."""
from functools import cache


@cache
def omw_en_pos() -> dict[str, frozenset[str]]:
"""Return {lemma_lower: frozenset of WN POSes}. Empty if omw-en unavailable."""
try:
import wn
en = wn.Wordnet(lexicon="omw-en")
except Exception:
return {}
by_lemma: dict[str, set[str]] = {}
for word in en.words():
for form in (word.lemma(), *word.forms()):
by_lemma.setdefault(form.lower(), set()).add(word.pos)
return {lemma: frozenset(pos) for lemma, pos in by_lemma.items()}
Loading
Loading