feat(wikidata-lexemes): forms, content-POS gap escape, quality filters by AmitMY · Pull Request #326 · goodmami/wn

AmitMY · 2026-05-22T15:42:12Z

Summary

Emit <Form> elements from each Wikidata lexeme's forms[], so lookups for inflected forms like mine/should/could resolve to the base lemma's entry. Filter out forms with negation grammaticalFeature Q1478451 (shouldn't, shan't) and apostrophe-leading contractions ('ll, 'd).
English content-POS gap escape: a SKIP_POS-classified lemma whose (lemma, POS) isn't already in omw-en (under any morphological lookup) is now kept. Pulls in ~7k missing function/content words.
Modal-verb override (P31 = Q560570): always keep can/will/shall so their forms could/would/should surface — bypasses the archaic-Wiktionary filter for these.
Wiktionary REST fallback for 0-sense lexemes, with quality filters: reference-only definitions, onomatopoeia / dialectal / archaic category filters, capitalised-lemma filter, empty-claims filter.
Quality filters (dedup + multi-word-capital + empty-claims) moved into filter_lexemes so kept_lang_senses matches what's actually written — without this, sense-relation targets dangle and the merge step rejects the XML.

Module refactor

New _wikidata.py: cached fetch_wikidata_entity / get_label / get_language_iso + safe_filename + cached_json_fetch.
New _omw_en.py: cached omw_en_pos() returning {lemma: frozenset of POSes}.
New _wiktionary.py: REST + Action API fetchers with per-thread requests.Session, batched/paginated category fetches, HTML entity decoding, archaic-or-sound filter scoped to English.
_pos_map.py: derived CONTENT_POS_MAP (Wikidata POS label → WN POS code) for the gap-escape check.
create_extensions.py: xml.sax.saxutils.escape replaces hand-rolled XML escaping; NormalizedSense namedtuple replaces dict-shape smuggling; caches now under extras/{wikidata,wiktionary,wiktionary-cats}/.

Coverage

Verified end-to-end via the Docker image. EN extension: ~7,900 entries, 130 language files. Top-1000 most-common English words covered: omw-en alone 94.1% → with this extension 99.2%. Remaining 8 (microsoft, sony, ebay, et, rss, st, jul, jun) are brand names / abbreviations / Latin loanwords — defensible exclusions.

Test plan

docker build --network=host --build-arg NO_PROXY= --build-arg no_proxy= --build-arg HTTP_PROXY= --build-arg HTTPS_PROXY= -t wn-test . succeeds (merge step loads all 130 XMLs without dangling-relation errors).
docker run --rm wn-test python -c "import wn, wn.morphy; en = wn.Wordnet(lexicon='omw-en', lemmatizer=wn.morphy.Morphy()); print([(w.lemma(), w.pos, w.synsets()[0].definition()) for w in en.words('mine')])" returns the my pronoun entry with definition (first-person singular possessive) belonging to me.
Same lookup for should returns shall (v, "Used before a verb to indicate the simple future tense …").
python create_extensions.py regenerates from the dump deterministically (no network calls if the Wiktionary cache is warm).

🤖 Generated with Claude Code

Add form_lexicon_index to improve forms query performance by enabling direct join from forms to lexicons. Update query to use SQL DISTINCT and indexed columns for faster results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add form_lexicon_form_covering_index on (lexicon_rowid, form) which eliminates temporary B-tree creation for DISTINCT operations and allows the query to be satisfied entirely from the index. Reduces forms query time by ~55% (3s → ~1.3s estimated in production). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…ker config - Add with_entities query param to forms endpoint (default true), filtering out forms with uppercase letters at SQL level when false - Use cached_response decorator and lru_cache for forms endpoint - Optimize Dockerfile: copy only needed files, use PORT env variable - Add German WordNet (odenet:1.4) to Docker image Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merged 47 commits from upstream including: - Python 3.10+ requirement - New wn.ili module - Sense.synset_relations() method - Code formatting and linting improvements Kept our web module (wn/web.py, tests/web_test.py, pyproject.toml web extra) that upstream removed, and merged schema hashes for compatibility. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The merged upstream schema has changes (modified column in lexicons, new tables for ILI) that resulted in a different schema hash. Added the new hash to COMPATIBLE_SCHEMA_HASHES. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds multilingual WordNet extensions generated from Wikidata Lexemes, covering function words (prepositions, conjunctions, determiners, pronouns, etc.) across ~130 languages. Updates Dockerfile to load extension XML files into the database at build time and fixes CMD to use JSON array format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Include SemCor corpus frequency counts in the included synsets of the words endpoint, enabling clients to rank word senses by usage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pre-warm DB connection and forms query cache during startup so first user request doesn't pay the cold-query cost. Add debug logging to forms endpoint to identify server-side bottlenecks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The root endpoint now returns a JSON listing of available endpoints, so the test should expect 200 instead of 404. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Sort imports (I001) in web.py, web_test.py - Replace Optional/Union with modern X | None syntax (UP045, UP007) - Replace dict() calls with dict literals (C408) - Fix line length violations (E501) in web.py, _db.py, create_extensions.py - Combine nested if statements (SIM102) in create_extensions.py - Remove unnecessary "r" mode argument (UP015) in create_extensions.py - Extract helpers to reduce function complexity (C901) in create_extensions.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Change publish workflow trigger from push to workflow_dispatch - Add type: ignore comments to starlette middleware imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Level 4 achieves nearly the same compression ratio as level 9 (29% vs 28%) at 10x the speed (23ms vs 268ms for 152K forms). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bakes query planner statistics into the database image so SQLite can make better index choices on cold start. Also adds the post-ANALYZE schema hash to the compatibility list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* upstream/main: (26 commits) Bump version to v1.1.0 Support the WN_DATA_DIR environment variable Mypy fix Reset lexconf on translated synsets Fix mypy and ruff fmt issues Add method and command to view cache entries Improve type modeling of wn._config Use a ResourceType enum instead of constant strs Make schema hash resilient to SQLite maintenance operations Fix command syntax in CLI documentation Cut release in CHANGELOG.md; update email address Bump to v1.0.0 Add OMW 2.0 to the index Bump to version v1.0.0rc0 Remove Conda instructions; add warning about db Add wn.reset_database() function Add OEWN 2025 and 2025+ to the index Add support for the OEWN-2025 escaping scheme Some typing and code quality fixes Support exporting lexicon extensions ... # Conflicts: # wn/_db.py

…lifespan - Enrich word form responses with structured objects including pronunciation data - Migrate deprecated on_startup to lifespan context manager for Starlette compat - Add new schema hash after upstream merge (Pronunciation/Tag as lexicon elements) - Update en.xml wikidata lexemes extension data Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…function-word POS Previously the Dockerfile loaded each `extensions/wikidata-lexemes/output/*.xml` as a separate lexicon, so queries against e.g. `omw-en:1.4` couldn't see the new function-word entries. This change introduces a merge utility that calls `wn.add` and then rewrites `lexicon_rowid` across all content tables so the extension's data lives under the base lexicon, leaving one lexicon per language (see goodmami#304). Also adds six new POS codes (h pronoun, d determiner, m numeral, i interjection, q interrogative, y particle) to `wn.constants.PARTS_OF_SPEECH` and updates the Wikidata extension generator to emit grouped short codes instead of verbose labels like "personal pronoun" or "definite article". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat: merge Wikidata extensions into base lexicons + add function-word POS

First sign-specific fork release after upstream v1.1.0. Includes the Wikidata-extension merge utility, six new function-word POS codes, and the pronunciation/web fixes that landed on this fork. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- wn/web.py: move `from contextlib import asynccontextmanager` into the stdlib import group (was a one-off import section by itself, tripping ruff I001). - wn/_db.py: shorten two schema-hash comments that overflowed the 88-col limit after recent additions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous regeneration of extension XMLs left ~150 verbose Wikidata POS labels (`abbreviation`, `abstract noun`, `auxiliary verb`, ...) in place. That contradicts `create_extensions.py`'s contract of `POS_MAP.get(pos_name, OTHER)` and caused 8,193 entries/synsets across 66 languages to be stored with POS values outside `wn.constants.PARTS_OF_SPEECH`. After this change every `partOfSpeech` attribute is one of `{h, d, n, v, a, s, r, m, p, c, q, i, y, t, x, u}`. Anything Wikidata didn't classify into our function-word groupings collapses to `x`, matching what a fresh `create_extensions.py` run would produce. Reported by Cursor Bugbot on PR #1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous normalization collapsed everything outside the function-word POS_MAP to OTHER ("x"), but many Wikidata labels are clearly variants of standard WordNet codes: - noun variants → "n": common/abstract/compound/count/mass/personal/ locative/indeclinable/verbal noun, noun class, nominal, plurale tantum, jukugo, names, location - verb variants → "v": auxiliary/passive/nominalized verb, copula, participle, gerund, transgressive, converb, preverb, verbaliser, past tense, imperative, infinitive - adjective variants → "a": prenominal/adnominal/demonstrative/ possessive/relative/reflexive/indefinite/numeral/pronominal/ interrogative adjective, na-adjective, true adjective, predicative, nominal modifier - adverb variants → "r": postpositive/demonstrative/modal/correlative/ pronominal/predicative adverb, intensifier, verbal modifier - conjunction variants → "c": coordinating/subordinating/concessive conjunction, complementizer, linker - adposition variants → "p": postposition, ambiposition, localiser - numeral variants → "m": number, digit, ordinal numeral, classifier (Chinese/Japanese counter words and measure words) - pronoun variants → "h": subject/object/possessive/resumptive/ distributive/determinative pronoun, pro-form - determiner variants → "d": definite/indefinite determiner, interrogative determinant, article, quantifier, partitive - particle variants → "y": modal/negative/emphatic/vocative/allocutive/ interrogative/verbal/japanese/honorific/exceptive particle, clitics, object/proper-noun markers, 助字 - phrase variants → "t": phraseme, established collocation, fixed expression, locution, infinitive phrase, verbal expression, etc. - interjection variants → "i": greeting, idiomatic interjection - interrogative variants → "q": question, question tag The "x" bucket now only contains genuinely non-POS labels: roots, plurals, sub-word affixes, onomatopoeia, sentences, etc. Brings 5,195 more entries into typed POS codes (8,193 → 2,998 still "x"), across all 66 affected language XMLs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Patch release covering the post-v1.2.0 fixes: - Lint fixes on main - Normalize unmapped POS values to OTHER (Cursor Bugbot finding) - Expand POS_MAP with noun/verb/adjective/etc. variants Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…gloss fallback - create_extensions.py now writes <Example> tags under <Synset> in addition to <Sense>. The wn web API only surfaces examples via synset endpoints (and the synset entries inside /words `included`), so prior output silently hid every Wikidata usage example. - create_extensions.py now falls back to other-language glosses when the lemma language is missing one (prefer lemma lang → English → any). Eliminates empty <Definition> tags across all 130 XMLs. - Regenerated all extension output XMLs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Hosts like OrbStack inject IPv6 CIDR entries into NO_PROXY that httpx 0.28 fails to parse, breaking the in-image `wn download` step. Document the `--build-arg NO_PROXY=` workaround. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Resolves ruff UP038. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Minor release covering: - Wikidata extension examples now visible via the wn web API (mirrored from <Sense> onto <Synset>). - Multilingual gloss fallback eliminates empty <Definition> tags in the extension output. - README note for the OrbStack docker build workaround. - Ruff UP038 cleanup in wn/_config.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add alternative forms, English content-POS gap coverage, and aggressive quality filtering to the Wikidata Lexemes extension. Top-1000 EN coverage goes from 84% (omw-en alone) to 99.2% with the extension. Major changes: - Emit `<Form>` elements from each lexeme's `forms[]`, so lookups for inflected forms like `mine`/`should`/`could` resolve to their base lemma's entry. Filter out forms with negation grammaticalFeature `Q1478451` (`shouldn't`, `shan't`) and apostrophe-leading contractions (`'ll`, `'d`). - English content-POS gap escape: a SKIP_POS-classified lemma whose (lemma, POS) isn't in omw-en under any morphological lookup is now included. Pulls in ~7k missing function/content words. - Modal-verb override (`P31 = Q560570`): always keep `can`/`will`/`shall` so their forms `could`/`would`/`should` surface. Bypasses the archaic- Wiktionary filter for these. - Wiktionary REST fallback for 0-sense lexemes, with quality filters: reference-only definitions, onomatopoeia / dialectal / archaic category filters, capitalised-lemma filter, empty-claims filter. - Filter dedup + multi-word-capital + empty-claims moved into `filter_lexemes` so `kept_lang_senses` matches what we actually write — otherwise sense-relation targets dangle when the merge step loads the XML. Module refactor: - New `_wikidata.py`: cached `fetch_wikidata_entity`/`get_label`/ `get_language_iso` + `safe_filename` + `cached_json_fetch`. - New `_omw_en.py`: cached `omw_en_pos()` returning `{lemma: frozenset of POSes}`. - New `_wiktionary.py`: REST + Action API fetchers with per-thread `requests.Session`, batched/paginated category fetches, HTML entity decoding, archaic-or-sound filter scoped to English. - `_pos_map.py`: derived `CONTENT_POS_MAP` (label → WN POS code) for the gap-escape check. - `create_extensions.py`: `xml.sax.saxutils.escape` replaces hand-rolled XML escaping; `NormalizedSense` namedtuple replaces dict-shape smuggling for Wiktionary fallback senses; in-script caches under `extras/{wikidata,wiktionary,wiktionary-cats}/`. EN extension: ~7,900 entries, 130 language files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

AmitMY and others added 29 commits July 24, 2025 10:28

make wn deployable

6914d3b

feat(web): add definitions endpoint and faster forms

c11cede

feat(web): add sense frequency count to word endpoint synsets

ad0ec53

Include SemCor corpus frequency counts in the included synsets of the words endpoint, enabling clients to rank word senses by usage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(test): update test_root to expect 200 from index endpoint

9aca580

The root endpoint now returns a JSON listing of available endpoints, so the test should expect 200 instead of 404. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: disable auto-publish workflow and fix mypy ignores

581a71e

- Change publish workflow trigger from push to workflow_dispatch - Add type: ignore comments to starlette middleware imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

perf(web): lower gzip compression level from 9 to 4

ca48e0a

Level 4 achieves nearly the same compression ratio as level 9 (29% vs 28%) at 10x the speed (23ms vs 268ms for 152K forms). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #1 from sign/feat/merge-wikidata-extensions

4d32f41

feat: merge Wikidata extensions into base lexicons + add function-word POS

chore: bump version to v1.2.0

7dc4e89

First sign-specific fork release after upstream v1.1.0. Includes the Wikidata-extension merge utility, six new function-word POS codes, and the pronunciation/web fixes that landed on this fork. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chore: bump version to v1.2.1

3cffbd2

Patch release covering the post-v1.2.0 fixes: - Lint fixes on main - Normalize unmapped POS values to OTHER (Cursor Bugbot finding) - Expand POS_MAP with noun/verb/adjective/etc. variants Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

style(_config): use PEP-604 union in isinstance check

9ca641c

Resolves ruff UP038. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

AmitMY closed this May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wikidata-lexemes): forms, content-POS gap escape, quality filters#326

feat(wikidata-lexemes): forms, content-POS gap escape, quality filters#326
AmitMY wants to merge 29 commits into
goodmami:mainfrom
sign:feat/wikidata-extension-coverage

AmitMY commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmitMY commented May 22, 2026

Summary

Module refactor

Coverage

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant