update from Gluejar#94
Open
eshellman wants to merge 1111 commits into
Open
Conversation
more harvest
2023 final
update handling of DOAB coers
fix all the null doab covers!
and don' call distinct unless needed
update tests, fix slow OPDS, optimize queryset access
Closes part of #1131. The SEND_TEST_EMAIL_JOB is a dev-era artifact that's been hardcoded to mail `unglueit@ebookfoundation.org` 72 times/day whenever a deploy is tagged `deploy_type=test`. On production it's never registered (`'prod' == 'test'` is False), but on test.unglue.it it fired 1,796 times through 2026-02-24, polluting Eric's inbox with "hi there 20,25,30" / "testing 1, 2, 3" messages. See #1131 for the full analysis. The companion change removing the `if '{{ deploy_type }}' == 'test':` branch from `prod.py.j2` is in EbookFoundation/regluit-provisioning. Also removes the commented-out reference in settings/dev.py so the symbol name doesn't dangle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Catch ValueError/TypeError/KeyError from int(self.data["notarobot"]) and raise forms.ValidationError instead, so malformed submissions render as a normal form error rather than a 500. Fixes #1136 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The legacy `auth_logout` name was provided by old django-registration; django_registration 3.4 no longer defines it. `django.contrib.auth.urls` (already included in libraryauth/urls.py) provides the equivalent name `logout`. Without this, any authenticated request that renders base.html crashes with NoReverseMatch. Surfaced on dj42.unglue.it after a successful Google sign-in. Note: Django 4.2's LogoutView still accepts GET with a deprecation warning; the GET-via-link → POST-only migration is a separate follow-up needed before Django 5.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Django 4.2's LogoutView, with no `next` param and no LOGOUT_REDIRECT_URL configured, falls through to rendering `registration/logged_out.html`. This template does not exist in the repo (the legacy template was `registration/logout.html`, served by old django-registration's auth_logout view). The navbar Sign Out link has no `?next=`, so without this setting clicking Sign Out would produce TemplateDoesNotExist. Caught by Codex on PR #1145. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix auth_logout URL name in templates (Django 4.2)
The 429 handler in load_doab_oai logs to the Django logger (unglue.it.log) but otherwise returns normally — so the cron's doab-harvest.log records "loaded 0 records, ending at 2000-01-01" or a truncated success line, indistinguishable from a clean run. Add a 4th return value (`error`) from load_doab_oai, set when 429 is caught. The management command echoes "ERROR: DOAB OAI rate-limited..." to stdout when present, so 429s appear in doab-harvest.log alongside the regular summary line. Bitstream-API 429s in doab_utils.get_streamdata are out of scope for this patch — they fire per-record rather than terminating the harvest, and surfacing them well needs aggregation. Filed as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When DOAB's OAI endpoint returns HTTP 429, the previous behavior was to log and continue. Cron's next nightly run would re-hit DOAB inside the ban window, which refreshes the 24h ban. The harvest never escaped on its own. This commit: - Persists the Retry-After deadline to a sentinel file across cron invocations. - Skips subsequent runs while now < deadline (and clears the file once the deadline passes). - Honors RFC 9110 Retry-After (delta-seconds or HTTP-date) via the parser in core.loaders.doab. - Falls back to a conservative 1-hour deadline if Retry-After is missing or unparseable (handled in core.loaders.doab). - Uses timezone-aware UTC throughout; persisted ISO strings include the offset. - Logs sentinel write failures to stderr with a traceback; silent swallowing here would mask the very ban-extension loop this patch prevents. - Adds --ignore-retry-after operator override. Sentinel path defaults to /var/log/regluit/doab-oai-retry-after.state (durable, in a dir already writable by the cron user, not touched by the existing log-cleanup cron). Override via DOAB_OAI_SENTINEL_PATH env var for dev/test environments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surface DOAB OAI 429 errors in load_doab cron output
…after Respect DOAB OAI Retry-After across cron runs
The DOAB bitstream/REST endpoints (used by get_streamdata and the cover download in store_doab_cover) get rate-limited independently of the OAI endpoint. The OAI fix (#1143 + #1144) is a cross-cron sentinel; the bitstream API is called per-record inside the harvest loop, so it needs an in-process circuit breaker that also persists across runs. This commit: - Adds an in-process circuit breaker in core.loaders.doab_utils that opens when either bitstream endpoint returns 429. While open, calls return None without making HTTP requests. - Persists the deadline to a sentinel file (/var/log/regluit/doab-bitstream-retry-after.state by default; DOAB_BITSTREAM_SENTINEL_PATH env override) so subsequent cron runs honor the same window without an extra hit. - Reuses parse_retry_after() (RFC 9110 delta-seconds OR HTTP-date) by moving it from doab.py into doab_utils.py. - Single ERROR log per trip — no more per-record 429 spam in the Django log (was ~14/run on May 8). - Patches both endpoints (get_streamdata + the cover requests.get call inside store_doab_cover, including the post-redirect retry). - Test-only reset hook for unit tests. Doesn't stall OAI harvesting — only cover work is skipped while the bitstream breaker is open. Each cover lookup returns None (= no cover), and the harvest continues processing the record. doab-check has the same get_streamdata code but no callers (dead code inherited from regluit) — patch confined to this repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…breaker Add DOAB bitstream API circuit breaker (HTTP 429)
Switches the pyoai pin from infrae/pyoai (last release March 2022, explicitly unmaintained per its own README) to our fork carrying the RateLimitedError patch: EbookFoundation/pyoai @ bf709d26ae6e4b34b9b0ca726e0f032d97f0bd38 Branch: fix/expose-429-as-rate-limited-error The fork raises a structured RateLimitedError on HTTP 429 with Retry-After parsed per RFC 9110, instead of leaving callers to parse HTTPError.headers themselves. An upstream PR to eth-library/oaipmh (the maintained fork; infrae/pyoai is no longer accepting changes) is pending; we'll un-pin to that fork when it lands. load_doab_oai now catches RateLimitedError first; the existing HTTPError-with-code-429 path is kept as a runtime fallback for deployments that revert the pin or run against stock pyoai. A never-raised placeholder class handles the ImportError case so the except clause is always syntactically valid. Step C of today's DOAB rate-limit plan (after #1143/#1144 OAI sentinel and #1147 bitstream breaker). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin pyoai to EbookFoundation fork; consume RateLimitedError
Resolve settings/dev.py conflict: keep Celery 5.x setting renames from this branch (CELERY_BEAT_SCHEDULE, CELERY_WORKER_HIJACK_ROOT_LOGGER) plus master's ADMINS email update; both sides drop the send_test_email schedule line. Brings in 16 master commits since fork (DOAB 429 handling, pyoai EbookFoundation fork pin, notarobot int-guard, SEND_TEST_EMAIL_JOB removal, admin-email rate limit). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compute the publication year range from a single evaluated edition queryset instead of two separate (asc + desc) queries. The previous code initialized `latest_publication` to None and set it only from the second query; when the dated-edition set changed between the two reads (e.g. concurrent edition loading) the second query could find no truthy dates, leaving `latest_publication` None and crashing on `earliest_publication + "-" + latest_publication`. Also switch to save(update_fields=['publication_range']) so this nominally-read property doesn't persist other (possibly stale) in-memory Work fields. Adds WorkTests regression coverage: mixed null/blank/valid dates, single-year vs range, all-blank, and a query-count guard proving the single-query shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The api `widget` view treated any non-"featured", non-10/13-char token
as a work id and called safe_get_work() without catching
Work.DoesNotExist, so /api/widget/<bad-id>/ returned 500 instead of the
existing empty-widget response. Two related defects fixed at the same
time:
- convert_10_to_13() returns None for an invalid ISBN-10, after which
`len(isbn)` raised TypeError. Guard with `if isbn and len(isbn)==13`.
- widget.html renders "...ISBN {{isbn}}..." but the work=None paths did
not pass `isbn`. Pass it into every render path.
Wrap safe_get_work() in try/except Work.DoesNotExist -> render empty
widget, matching the existing Identifier.DoesNotExist handling.
Adds api ApiTests regression coverage: non-numeric token, numeric
unknown id, and invalid ISBN-10 all return 200.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keeps the Python 'if date' guard as belt-and-suspenders: the structural invariant (years contains only truthy strings) stays enforced locally, independent of the queryset. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…into dj42.unglue.it
…to dj42.unglue.it
Fix #1155: TypeError in Work.publication_date from two-query race
Fix #1156: widget endpoint 500s on unknown/non-numeric/invalid ids
Staging boxes restored from a prod snapshot keep prod's Site row (domain='unglue.it'), so every emailed link (password-reset, notices, etc.) points at prod instead of the staging box's own host. This command updates Site.objects.get_current() (the SITE_ID row) to the supplied domain (and optional name). It is idempotent: if the row already matches, it prints a no-op message and exits cleanly. Used by the provisioning repo's post-deploy Ansible task to localise the Site to the box's own server_name on every deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mechanical, no-meaning-change corrections to the FAQ page surfaced by the CC+Codex copy review on #1165. Typos, a broken URL, proper-noun/acronym fixes, subject-verb agreement, and site-name casing — nothing factual. - "that why" → "that's why"; "an non-profit" → "a non-profit" - "do well be selling" → "by selling"; "the the copyright" → "the copyright" - "right holder tools" → "rights holder tools"; "some interested" → "some interest" - broken Facebook URL "facebook/com" → "facebook.com" - "Wikisources/Hathi Trust/Github" → "Wikisource/HathiTrust/GitHub" - "page.You'll" → "page. You'll"; mid-sentence "Let" → "let" - "cannot not be obtained" → "cannot be obtained"; "They does" → "They do" - "Authors' Guild" → "Authors Guild"; CC license styling "NoDerivatives, NonCommercial" - site-name casing unglue.it → Unglue.it; "Thanks for Ungluing" → "Thanks-for-Ungluing" Factual/staleness/voice items (fees, payouts, sender email, campaign retirement, etc.) are handled separately in the judgment-call PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FAQ copy: objective typo & grammar fixes (#1165)
Mechanical, no-meaning-change corrections to the FAQ page surfaced by the CC+Codex copy review on #1165. Typos, a broken URL, proper-noun/acronym fixes, subject-verb agreement, and site-name casing — nothing factual. - "that why" → "that's why"; "an non-profit" → "a non-profit" - "do well be selling" → "by selling"; "the the copyright" → "the copyright" - "right holder tools" → "rights holder tools"; "some interested" → "some interest" - broken Facebook URL "facebook/com" → "facebook.com" - "Wikisources/Hathi Trust/Github" → "Wikisource/HathiTrust/GitHub" - "page.You'll" → "page. You'll"; mid-sentence "Let" → "let" - "cannot not be obtained" → "cannot be obtained"; "They does" → "They do" - "Authors' Guild" → "Authors Guild"; CC license styling "NoDerivatives, NonCommercial" - site-name casing unglue.it → Unglue.it; "Thanks for Ungluing" → "Thanks-for-Ungluing" Factual/staleness/voice items (fees, payouts, sender email, campaign retirement, etc.) are handled separately in the judgment-call PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 0c43233)
… not a hard fail Codex review of #1164 fix: a fresh/scrubbed DB without a Site row for SITE_ID would crash the post-deploy task with Site.DoesNotExist. get_or_create makes it self-healing while staying idempotent on existing rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add set_site_domain management command (fix #1164)
Staging boxes restored from a prod snapshot keep prod's Site row (domain='unglue.it'), so every emailed link (password-reset, notices, etc.) points at prod instead of the staging box's own host. This command updates Site.objects.get_current() (the SITE_ID row) to the supplied domain (and optional name). It is idempotent: if the row already matches, it prints a no-op message and exits cleanly. Used by the provisioning repo's post-deploy Ansible task to localise the Site to the box's own server_name on every deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… not a hard fail Codex review of #1164 fix: a fresh/scrubbed DB without a Site row for SITE_ID would crash the post-deploy task with Site.DoesNotExist. get_or_create makes it self-healing while staying idempotent on existing rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.