Skip to content

feat(wikidata-lexemes): forms, content-POS gap escape, quality filters#326

Closed
AmitMY wants to merge 29 commits into
goodmami:mainfrom
sign:feat/wikidata-extension-coverage
Closed

feat(wikidata-lexemes): forms, content-POS gap escape, quality filters#326
AmitMY wants to merge 29 commits into
goodmami:mainfrom
sign:feat/wikidata-extension-coverage

Conversation

@AmitMY

@AmitMY AmitMY commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Emit <Form> elements from each Wikidata lexeme's forms[], so lookups for inflected forms like mine/should/could resolve to the base lemma's entry. Filter out forms with negation grammaticalFeature Q1478451 (shouldn't, shan't) and apostrophe-leading contractions ('ll, 'd).
  • English content-POS gap escape: a SKIP_POS-classified lemma whose (lemma, POS) isn't already in omw-en (under any morphological lookup) is now kept. Pulls in ~7k missing function/content words.
  • Modal-verb override (P31 = Q560570): always keep can/will/shall so their forms could/would/should surface — bypasses the archaic-Wiktionary filter for these.
  • Wiktionary REST fallback for 0-sense lexemes, with quality filters: reference-only definitions, onomatopoeia / dialectal / archaic category filters, capitalised-lemma filter, empty-claims filter.
  • Quality filters (dedup + multi-word-capital + empty-claims) moved into filter_lexemes so kept_lang_senses matches what's actually written — without this, sense-relation targets dangle and the merge step rejects the XML.

Module refactor

  • New _wikidata.py: cached fetch_wikidata_entity / get_label / get_language_iso + safe_filename + cached_json_fetch.
  • New _omw_en.py: cached omw_en_pos() returning {lemma: frozenset of POSes}.
  • New _wiktionary.py: REST + Action API fetchers with per-thread requests.Session, batched/paginated category fetches, HTML entity decoding, archaic-or-sound filter scoped to English.
  • _pos_map.py: derived CONTENT_POS_MAP (Wikidata POS label → WN POS code) for the gap-escape check.
  • create_extensions.py: xml.sax.saxutils.escape replaces hand-rolled XML escaping; NormalizedSense namedtuple replaces dict-shape smuggling; caches now under extras/{wikidata,wiktionary,wiktionary-cats}/.

Coverage

Verified end-to-end via the Docker image. EN extension: ~7,900 entries, 130 language files. Top-1000 most-common English words covered: omw-en alone 94.1% → with this extension 99.2%. Remaining 8 (microsoft, sony, ebay, et, rss, st, jul, jun) are brand names / abbreviations / Latin loanwords — defensible exclusions.

Test plan

  • docker build --network=host --build-arg NO_PROXY= --build-arg no_proxy= --build-arg HTTP_PROXY= --build-arg HTTPS_PROXY= -t wn-test . succeeds (merge step loads all 130 XMLs without dangling-relation errors).
  • docker run --rm wn-test python -c "import wn, wn.morphy; en = wn.Wordnet(lexicon='omw-en', lemmatizer=wn.morphy.Morphy()); print([(w.lemma(), w.pos, w.synsets()[0].definition()) for w in en.words('mine')])" returns the my pronoun entry with definition (first-person singular possessive) belonging to me.
  • Same lookup for should returns shall (v, "Used before a verb to indicate the simple future tense …").
  • python create_extensions.py regenerates from the dump deterministically (no network calls if the Wiktionary cache is warm).

🤖 Generated with Claude Code

AmitMY and others added 29 commits July 24, 2025 10:28
Add form_lexicon_index to improve forms query performance by enabling
direct join from forms to lexicons. Update query to use SQL DISTINCT
and indexed columns for faster results.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add form_lexicon_form_covering_index on (lexicon_rowid, form) which
eliminates temporary B-tree creation for DISTINCT operations and allows
the query to be satisfied entirely from the index. Reduces forms query
time by ~55% (3s → ~1.3s estimated in production).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ker config

- Add with_entities query param to forms endpoint (default true), filtering
  out forms with uppercase letters at SQL level when false
- Use cached_response decorator and lru_cache for forms endpoint
- Optimize Dockerfile: copy only needed files, use PORT env variable
- Add German WordNet (odenet:1.4) to Docker image

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merged 47 commits from upstream including:
- Python 3.10+ requirement
- New wn.ili module
- Sense.synset_relations() method
- Code formatting and linting improvements

Kept our web module (wn/web.py, tests/web_test.py, pyproject.toml web extra)
that upstream removed, and merged schema hashes for compatibility.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The merged upstream schema has changes (modified column in lexicons,
new tables for ILI) that resulted in a different schema hash. Added
the new hash to COMPATIBLE_SCHEMA_HASHES.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds multilingual WordNet extensions generated from Wikidata Lexemes,
covering function words (prepositions, conjunctions, determiners, pronouns,
etc.) across ~130 languages. Updates Dockerfile to load extension XML files
into the database at build time and fixes CMD to use JSON array format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include SemCor corpus frequency counts in the included synsets
of the words endpoint, enabling clients to rank word senses by usage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-warm DB connection and forms query cache during startup so
first user request doesn't pay the cold-query cost. Add debug
logging to forms endpoint to identify server-side bottlenecks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The root endpoint now returns a JSON listing of available endpoints,
so the test should expect 200 instead of 404.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sort imports (I001) in web.py, web_test.py
- Replace Optional/Union with modern X | None syntax (UP045, UP007)
- Replace dict() calls with dict literals (C408)
- Fix line length violations (E501) in web.py, _db.py, create_extensions.py
- Combine nested if statements (SIM102) in create_extensions.py
- Remove unnecessary "r" mode argument (UP015) in create_extensions.py
- Extract helpers to reduce function complexity (C901) in create_extensions.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change publish workflow trigger from push to workflow_dispatch
- Add type: ignore comments to starlette middleware imports

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Level 4 achieves nearly the same compression ratio as level 9
(29% vs 28%) at 10x the speed (23ms vs 268ms for 152K forms).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bakes query planner statistics into the database image so SQLite
can make better index choices on cold start. Also adds the
post-ANALYZE schema hash to the compatibility list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* upstream/main: (26 commits)
  Bump version to v1.1.0
  Support the WN_DATA_DIR environment variable
  Mypy fix
  Reset lexconf on translated synsets
  Fix mypy and ruff fmt issues
  Add method and command to view cache entries
  Improve type modeling of wn._config
  Use a ResourceType enum instead of constant strs
  Make schema hash resilient to SQLite maintenance operations
  Fix command syntax in CLI documentation
  Cut release in CHANGELOG.md; update email address
  Bump to v1.0.0
  Add OMW 2.0 to the index
  Bump to version v1.0.0rc0
  Remove Conda instructions; add warning about db
  Add wn.reset_database() function
  Add OEWN 2025 and 2025+ to the index
  Add support for the OEWN-2025 escaping scheme
  Some typing and code quality fixes
  Support exporting lexicon extensions
  ...

# Conflicts:
#	wn/_db.py
…lifespan

- Enrich word form responses with structured objects including pronunciation data
- Migrate deprecated on_startup to lifespan context manager for Starlette compat
- Add new schema hash after upstream merge (Pronunciation/Tag as lexicon elements)
- Update en.xml wikidata lexemes extension data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…function-word POS

Previously the Dockerfile loaded each `extensions/wikidata-lexemes/output/*.xml`
as a separate lexicon, so queries against e.g. `omw-en:1.4` couldn't see
the new function-word entries. This change introduces a merge utility that
calls `wn.add` and then rewrites `lexicon_rowid` across all content tables
so the extension's data lives under the base lexicon, leaving one lexicon
per language (see goodmami#304).

Also adds six new POS codes (h pronoun, d determiner, m numeral, i interjection,
q interrogative, y particle) to `wn.constants.PARTS_OF_SPEECH` and updates
the Wikidata extension generator to emit grouped short codes instead of
verbose labels like "personal pronoun" or "definite article".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
feat: merge Wikidata extensions into base lexicons + add function-word POS
First sign-specific fork release after upstream v1.1.0.
Includes the Wikidata-extension merge utility, six new
function-word POS codes, and the pronunciation/web fixes
that landed on this fork.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- wn/web.py: move `from contextlib import asynccontextmanager` into the
  stdlib import group (was a one-off import section by itself, tripping
  ruff I001).
- wn/_db.py: shorten two schema-hash comments that overflowed the 88-col
  limit after recent additions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous regeneration of extension XMLs left ~150 verbose Wikidata
POS labels (`abbreviation`, `abstract noun`, `auxiliary verb`, ...) in
place. That contradicts `create_extensions.py`'s contract of
`POS_MAP.get(pos_name, OTHER)` and caused 8,193 entries/synsets across
66 languages to be stored with POS values outside
`wn.constants.PARTS_OF_SPEECH`.

After this change every `partOfSpeech` attribute is one of
`{h, d, n, v, a, s, r, m, p, c, q, i, y, t, x, u}`. Anything Wikidata
didn't classify into our function-word groupings collapses to `x`,
matching what a fresh `create_extensions.py` run would produce.

Reported by Cursor Bugbot on PR #1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous normalization collapsed everything outside the function-word
POS_MAP to OTHER ("x"), but many Wikidata labels are clearly variants of
standard WordNet codes:

- noun variants → "n": common/abstract/compound/count/mass/personal/
  locative/indeclinable/verbal noun, noun class, nominal, plurale tantum,
  jukugo, names, location
- verb variants → "v": auxiliary/passive/nominalized verb, copula,
  participle, gerund, transgressive, converb, preverb, verbaliser,
  past tense, imperative, infinitive
- adjective variants → "a": prenominal/adnominal/demonstrative/
  possessive/relative/reflexive/indefinite/numeral/pronominal/
  interrogative adjective, na-adjective, true adjective, predicative,
  nominal modifier
- adverb variants → "r": postpositive/demonstrative/modal/correlative/
  pronominal/predicative adverb, intensifier, verbal modifier
- conjunction variants → "c": coordinating/subordinating/concessive
  conjunction, complementizer, linker
- adposition variants → "p": postposition, ambiposition, localiser
- numeral variants → "m": number, digit, ordinal numeral, classifier
  (Chinese/Japanese counter words and measure words)
- pronoun variants → "h": subject/object/possessive/resumptive/
  distributive/determinative pronoun, pro-form
- determiner variants → "d": definite/indefinite determiner,
  interrogative determinant, article, quantifier, partitive
- particle variants → "y": modal/negative/emphatic/vocative/allocutive/
  interrogative/verbal/japanese/honorific/exceptive particle, clitics,
  object/proper-noun markers, 助字
- phrase variants → "t": phraseme, established collocation, fixed
  expression, locution, infinitive phrase, verbal expression, etc.
- interjection variants → "i": greeting, idiomatic interjection
- interrogative variants → "q": question, question tag

The "x" bucket now only contains genuinely non-POS labels: roots,
plurals, sub-word affixes, onomatopoeia, sentences, etc.

Brings 5,195 more entries into typed POS codes (8,193 → 2,998 still "x"),
across all 66 affected language XMLs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Patch release covering the post-v1.2.0 fixes:
- Lint fixes on main
- Normalize unmapped POS values to OTHER (Cursor Bugbot finding)
- Expand POS_MAP with noun/verb/adjective/etc. variants

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…gloss fallback

- create_extensions.py now writes <Example> tags under <Synset> in
  addition to <Sense>. The wn web API only surfaces examples via
  synset endpoints (and the synset entries inside /words `included`),
  so prior output silently hid every Wikidata usage example.
- create_extensions.py now falls back to other-language glosses when
  the lemma language is missing one (prefer lemma lang → English →
  any). Eliminates empty <Definition> tags across all 130 XMLs.
- Regenerated all extension output XMLs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Hosts like OrbStack inject IPv6 CIDR entries into NO_PROXY that
httpx 0.28 fails to parse, breaking the in-image `wn download` step.
Document the `--build-arg NO_PROXY=` workaround.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolves ruff UP038.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Minor release covering:
- Wikidata extension examples now visible via the wn web API
  (mirrored from <Sense> onto <Synset>).
- Multilingual gloss fallback eliminates empty <Definition> tags
  in the extension output.
- README note for the OrbStack docker build workaround.
- Ruff UP038 cleanup in wn/_config.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add alternative forms, English content-POS gap coverage, and aggressive
quality filtering to the Wikidata Lexemes extension. Top-1000 EN coverage
goes from 84% (omw-en alone) to 99.2% with the extension.

Major changes:
- Emit `<Form>` elements from each lexeme's `forms[]`, so lookups for
  inflected forms like `mine`/`should`/`could` resolve to their base
  lemma's entry. Filter out forms with negation grammaticalFeature
  `Q1478451` (`shouldn't`, `shan't`) and apostrophe-leading contractions
  (`'ll`, `'d`).
- English content-POS gap escape: a SKIP_POS-classified lemma whose
  (lemma, POS) isn't in omw-en under any morphological lookup is now
  included. Pulls in ~7k missing function/content words.
- Modal-verb override (`P31 = Q560570`): always keep `can`/`will`/`shall`
  so their forms `could`/`would`/`should` surface. Bypasses the archaic-
  Wiktionary filter for these.
- Wiktionary REST fallback for 0-sense lexemes, with quality filters:
  reference-only definitions, onomatopoeia / dialectal / archaic
  category filters, capitalised-lemma filter, empty-claims filter.
- Filter dedup + multi-word-capital + empty-claims moved into
  `filter_lexemes` so `kept_lang_senses` matches what we actually write
  — otherwise sense-relation targets dangle when the merge step loads
  the XML.

Module refactor:
- New `_wikidata.py`: cached `fetch_wikidata_entity`/`get_label`/
  `get_language_iso` + `safe_filename` + `cached_json_fetch`.
- New `_omw_en.py`: cached `omw_en_pos()` returning `{lemma: frozenset
  of POSes}`.
- New `_wiktionary.py`: REST + Action API fetchers with per-thread
  `requests.Session`, batched/paginated category fetches, HTML entity
  decoding, archaic-or-sound filter scoped to English.
- `_pos_map.py`: derived `CONTENT_POS_MAP` (label → WN POS code) for
  the gap-escape check.
- `create_extensions.py`: `xml.sax.saxutils.escape` replaces hand-rolled
  XML escaping; `NormalizedSense` namedtuple replaces dict-shape
  smuggling for Wiktionary fallback senses; in-script caches under
  `extras/{wikidata,wiktionary,wiktionary-cats}/`.

EN extension: ~7,900 entries, 130 language files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@AmitMY AmitMY closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant