feat(wikidata-lexemes): forms, content-POS gap escape, quality filters#326
Closed
AmitMY wants to merge 29 commits into
Closed
feat(wikidata-lexemes): forms, content-POS gap escape, quality filters#326AmitMY wants to merge 29 commits into
AmitMY wants to merge 29 commits into
Conversation
Add form_lexicon_index to improve forms query performance by enabling direct join from forms to lexicons. Update query to use SQL DISTINCT and indexed columns for faster results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add form_lexicon_form_covering_index on (lexicon_rowid, form) which eliminates temporary B-tree creation for DISTINCT operations and allows the query to be satisfied entirely from the index. Reduces forms query time by ~55% (3s → ~1.3s estimated in production). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…ker config - Add with_entities query param to forms endpoint (default true), filtering out forms with uppercase letters at SQL level when false - Use cached_response decorator and lru_cache for forms endpoint - Optimize Dockerfile: copy only needed files, use PORT env variable - Add German WordNet (odenet:1.4) to Docker image Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merged 47 commits from upstream including: - Python 3.10+ requirement - New wn.ili module - Sense.synset_relations() method - Code formatting and linting improvements Kept our web module (wn/web.py, tests/web_test.py, pyproject.toml web extra) that upstream removed, and merged schema hashes for compatibility. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The merged upstream schema has changes (modified column in lexicons, new tables for ILI) that resulted in a different schema hash. Added the new hash to COMPATIBLE_SCHEMA_HASHES. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds multilingual WordNet extensions generated from Wikidata Lexemes, covering function words (prepositions, conjunctions, determiners, pronouns, etc.) across ~130 languages. Updates Dockerfile to load extension XML files into the database at build time and fixes CMD to use JSON array format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include SemCor corpus frequency counts in the included synsets of the words endpoint, enabling clients to rank word senses by usage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-warm DB connection and forms query cache during startup so first user request doesn't pay the cold-query cost. Add debug logging to forms endpoint to identify server-side bottlenecks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The root endpoint now returns a JSON listing of available endpoints, so the test should expect 200 instead of 404. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sort imports (I001) in web.py, web_test.py - Replace Optional/Union with modern X | None syntax (UP045, UP007) - Replace dict() calls with dict literals (C408) - Fix line length violations (E501) in web.py, _db.py, create_extensions.py - Combine nested if statements (SIM102) in create_extensions.py - Remove unnecessary "r" mode argument (UP015) in create_extensions.py - Extract helpers to reduce function complexity (C901) in create_extensions.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change publish workflow trigger from push to workflow_dispatch - Add type: ignore comments to starlette middleware imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Level 4 achieves nearly the same compression ratio as level 9 (29% vs 28%) at 10x the speed (23ms vs 268ms for 152K forms). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bakes query planner statistics into the database image so SQLite can make better index choices on cold start. Also adds the post-ANALYZE schema hash to the compatibility list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* upstream/main: (26 commits) Bump version to v1.1.0 Support the WN_DATA_DIR environment variable Mypy fix Reset lexconf on translated synsets Fix mypy and ruff fmt issues Add method and command to view cache entries Improve type modeling of wn._config Use a ResourceType enum instead of constant strs Make schema hash resilient to SQLite maintenance operations Fix command syntax in CLI documentation Cut release in CHANGELOG.md; update email address Bump to v1.0.0 Add OMW 2.0 to the index Bump to version v1.0.0rc0 Remove Conda instructions; add warning about db Add wn.reset_database() function Add OEWN 2025 and 2025+ to the index Add support for the OEWN-2025 escaping scheme Some typing and code quality fixes Support exporting lexicon extensions ... # Conflicts: # wn/_db.py
…lifespan - Enrich word form responses with structured objects including pronunciation data - Migrate deprecated on_startup to lifespan context manager for Starlette compat - Add new schema hash after upstream merge (Pronunciation/Tag as lexicon elements) - Update en.xml wikidata lexemes extension data Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…function-word POS Previously the Dockerfile loaded each `extensions/wikidata-lexemes/output/*.xml` as a separate lexicon, so queries against e.g. `omw-en:1.4` couldn't see the new function-word entries. This change introduces a merge utility that calls `wn.add` and then rewrites `lexicon_rowid` across all content tables so the extension's data lives under the base lexicon, leaving one lexicon per language (see goodmami#304). Also adds six new POS codes (h pronoun, d determiner, m numeral, i interjection, q interrogative, y particle) to `wn.constants.PARTS_OF_SPEECH` and updates the Wikidata extension generator to emit grouped short codes instead of verbose labels like "personal pronoun" or "definite article". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
feat: merge Wikidata extensions into base lexicons + add function-word POS
First sign-specific fork release after upstream v1.1.0. Includes the Wikidata-extension merge utility, six new function-word POS codes, and the pronunciation/web fixes that landed on this fork. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- wn/web.py: move `from contextlib import asynccontextmanager` into the stdlib import group (was a one-off import section by itself, tripping ruff I001). - wn/_db.py: shorten two schema-hash comments that overflowed the 88-col limit after recent additions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous regeneration of extension XMLs left ~150 verbose Wikidata
POS labels (`abbreviation`, `abstract noun`, `auxiliary verb`, ...) in
place. That contradicts `create_extensions.py`'s contract of
`POS_MAP.get(pos_name, OTHER)` and caused 8,193 entries/synsets across
66 languages to be stored with POS values outside
`wn.constants.PARTS_OF_SPEECH`.
After this change every `partOfSpeech` attribute is one of
`{h, d, n, v, a, s, r, m, p, c, q, i, y, t, x, u}`. Anything Wikidata
didn't classify into our function-word groupings collapses to `x`,
matching what a fresh `create_extensions.py` run would produce.
Reported by Cursor Bugbot on PR #1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous normalization collapsed everything outside the function-word
POS_MAP to OTHER ("x"), but many Wikidata labels are clearly variants of
standard WordNet codes:
- noun variants → "n": common/abstract/compound/count/mass/personal/
locative/indeclinable/verbal noun, noun class, nominal, plurale tantum,
jukugo, names, location
- verb variants → "v": auxiliary/passive/nominalized verb, copula,
participle, gerund, transgressive, converb, preverb, verbaliser,
past tense, imperative, infinitive
- adjective variants → "a": prenominal/adnominal/demonstrative/
possessive/relative/reflexive/indefinite/numeral/pronominal/
interrogative adjective, na-adjective, true adjective, predicative,
nominal modifier
- adverb variants → "r": postpositive/demonstrative/modal/correlative/
pronominal/predicative adverb, intensifier, verbal modifier
- conjunction variants → "c": coordinating/subordinating/concessive
conjunction, complementizer, linker
- adposition variants → "p": postposition, ambiposition, localiser
- numeral variants → "m": number, digit, ordinal numeral, classifier
(Chinese/Japanese counter words and measure words)
- pronoun variants → "h": subject/object/possessive/resumptive/
distributive/determinative pronoun, pro-form
- determiner variants → "d": definite/indefinite determiner,
interrogative determinant, article, quantifier, partitive
- particle variants → "y": modal/negative/emphatic/vocative/allocutive/
interrogative/verbal/japanese/honorific/exceptive particle, clitics,
object/proper-noun markers, 助字
- phrase variants → "t": phraseme, established collocation, fixed
expression, locution, infinitive phrase, verbal expression, etc.
- interjection variants → "i": greeting, idiomatic interjection
- interrogative variants → "q": question, question tag
The "x" bucket now only contains genuinely non-POS labels: roots,
plurals, sub-word affixes, onomatopoeia, sentences, etc.
Brings 5,195 more entries into typed POS codes (8,193 → 2,998 still "x"),
across all 66 affected language XMLs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Patch release covering the post-v1.2.0 fixes: - Lint fixes on main - Normalize unmapped POS values to OTHER (Cursor Bugbot finding) - Expand POS_MAP with noun/verb/adjective/etc. variants Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…gloss fallback - create_extensions.py now writes <Example> tags under <Synset> in addition to <Sense>. The wn web API only surfaces examples via synset endpoints (and the synset entries inside /words `included`), so prior output silently hid every Wikidata usage example. - create_extensions.py now falls back to other-language glosses when the lemma language is missing one (prefer lemma lang → English → any). Eliminates empty <Definition> tags across all 130 XMLs. - Regenerated all extension output XMLs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Hosts like OrbStack inject IPv6 CIDR entries into NO_PROXY that httpx 0.28 fails to parse, breaking the in-image `wn download` step. Document the `--build-arg NO_PROXY=` workaround. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolves ruff UP038. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Minor release covering: - Wikidata extension examples now visible via the wn web API (mirrored from <Sense> onto <Synset>). - Multilingual gloss fallback eliminates empty <Definition> tags in the extension output. - README note for the OrbStack docker build workaround. - Ruff UP038 cleanup in wn/_config.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add alternative forms, English content-POS gap coverage, and aggressive
quality filtering to the Wikidata Lexemes extension. Top-1000 EN coverage
goes from 84% (omw-en alone) to 99.2% with the extension.
Major changes:
- Emit `<Form>` elements from each lexeme's `forms[]`, so lookups for
inflected forms like `mine`/`should`/`could` resolve to their base
lemma's entry. Filter out forms with negation grammaticalFeature
`Q1478451` (`shouldn't`, `shan't`) and apostrophe-leading contractions
(`'ll`, `'d`).
- English content-POS gap escape: a SKIP_POS-classified lemma whose
(lemma, POS) isn't in omw-en under any morphological lookup is now
included. Pulls in ~7k missing function/content words.
- Modal-verb override (`P31 = Q560570`): always keep `can`/`will`/`shall`
so their forms `could`/`would`/`should` surface. Bypasses the archaic-
Wiktionary filter for these.
- Wiktionary REST fallback for 0-sense lexemes, with quality filters:
reference-only definitions, onomatopoeia / dialectal / archaic
category filters, capitalised-lemma filter, empty-claims filter.
- Filter dedup + multi-word-capital + empty-claims moved into
`filter_lexemes` so `kept_lang_senses` matches what we actually write
— otherwise sense-relation targets dangle when the merge step loads
the XML.
Module refactor:
- New `_wikidata.py`: cached `fetch_wikidata_entity`/`get_label`/
`get_language_iso` + `safe_filename` + `cached_json_fetch`.
- New `_omw_en.py`: cached `omw_en_pos()` returning `{lemma: frozenset
of POSes}`.
- New `_wiktionary.py`: REST + Action API fetchers with per-thread
`requests.Session`, batched/paginated category fetches, HTML entity
decoding, archaic-or-sound filter scoped to English.
- `_pos_map.py`: derived `CONTENT_POS_MAP` (label → WN POS code) for
the gap-escape check.
- `create_extensions.py`: `xml.sax.saxutils.escape` replaces hand-rolled
XML escaping; `NormalizedSense` namedtuple replaces dict-shape
smuggling for Wiktionary fallback senses; in-script caches under
`extras/{wikidata,wiktionary,wiktionary-cats}/`.
EN extension: ~7,900 entries, 130 language files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
<Form>elements from each Wikidata lexeme'sforms[], so lookups for inflected forms likemine/should/couldresolve to the base lemma's entry. Filter out forms with negation grammaticalFeatureQ1478451(shouldn't,shan't) and apostrophe-leading contractions ('ll,'d).SKIP_POS-classified lemma whose(lemma, POS)isn't already in omw-en (under any morphological lookup) is now kept. Pulls in ~7k missing function/content words.P31 = Q560570): always keepcan/will/shallso their formscould/would/shouldsurface — bypasses the archaic-Wiktionary filter for these.filter_lexemessokept_lang_sensesmatches what's actually written — without this, sense-relation targets dangle and the merge step rejects the XML.Module refactor
_wikidata.py: cachedfetch_wikidata_entity/get_label/get_language_iso+safe_filename+cached_json_fetch._omw_en.py: cachedomw_en_pos()returning{lemma: frozenset of POSes}._wiktionary.py: REST + Action API fetchers with per-threadrequests.Session, batched/paginated category fetches, HTML entity decoding, archaic-or-sound filter scoped to English._pos_map.py: derivedCONTENT_POS_MAP(Wikidata POS label → WN POS code) for the gap-escape check.create_extensions.py:xml.sax.saxutils.escapereplaces hand-rolled XML escaping;NormalizedSensenamedtuple replaces dict-shape smuggling; caches now underextras/{wikidata,wiktionary,wiktionary-cats}/.Coverage
Verified end-to-end via the Docker image. EN extension: ~7,900 entries, 130 language files. Top-1000 most-common English words covered: omw-en alone 94.1% → with this extension 99.2%. Remaining 8 (
microsoft,sony,ebay,et,rss,st,jul,jun) are brand names / abbreviations / Latin loanwords — defensible exclusions.Test plan
docker build --network=host --build-arg NO_PROXY= --build-arg no_proxy= --build-arg HTTP_PROXY= --build-arg HTTPS_PROXY= -t wn-test .succeeds (merge step loads all 130 XMLs without dangling-relation errors).docker run --rm wn-test python -c "import wn, wn.morphy; en = wn.Wordnet(lexicon='omw-en', lemmatizer=wn.morphy.Morphy()); print([(w.lemma(), w.pos, w.synsets()[0].definition()) for w in en.words('mine')])"returns themypronoun entry with definition(first-person singular possessive) belonging to me.shouldreturnsshall (v, "Used before a verb to indicate the simple future tense …").python create_extensions.pyregenerates from the dump deterministically (no network calls if the Wiktionary cache is warm).🤖 Generated with Claude Code