Skip to content

expire: skip per-article SMretrieve via cancel tombstone log#340

Open
kev009 wants to merge 2 commits into
InterNetNews:mainfrom
kev009:expire-tombstone
Open

expire: skip per-article SMretrieve via cancel tombstone log#340
kev009 wants to merge 2 commits into
InterNetNews:mainfrom
kev009:expire-tombstone

Conversation

@kev009
Copy link
Copy Markdown
Contributor

@kev009 kev009 commented May 9, 2026

Adds an opt-in (expiretombstone in inn.conf, default false) side-channel that records every cancellation alongside the existing storage and overview deletions, so a subsequent expire run can drop matching history entries without doing per-article SMretrieve(RETR_STAT) syscalls. At million-article scale on tradspool/timehash this turns a million access(2) calls into a few thousand hash lookups.

Two log files cooperate:

${pathdb}/expireover.tombstone: written by expireover after each
successful inline SMcancel; in delayrm mode written up front and
promoted by expirerm after fastrm succeeds. Atomic .NEW -> final
rename under an exclusive non-blocking fcntl POSIX lock. A
crashed-or-failed previous run's leftover .NEW is verified
per-token via SMretrieve and merged into the next run, so partial
rmfile failures cannot orphan articles on disk.

${pathdb}/cancels.tombstone: appended by innd's ARTcancel and by
sm -r through a new public SMcanceltombstone() helper. Shared
fcntl POSIX lock for appenders, exclusive for expire's snapshot
rename to .processing; one retry on EAGAIN closes the appender
vs consumer race window. POSIX O_APPEND atomicity for sub-
PIPE_BUF writes keeps single-line tokens from interleaving.

expire reads both files into one hashset, treats every article in either log as already gone, and falls back to per-article SMretrieve only for storage methods where SMprobe(SELFEXPIRE) is true (CNFS wrap-around). An empty-but-present tombstone correctly says "no cancels this cycle" and skips SMretrieve for everything; truly absent files fall back to the slow path.

The format is line-per-token; readers ignore blank and #-prefixed lines, and expireover writes "# inn-tombstone v1" as a forward- compatibility header. Disk and memory footprint scale linearly: ~38 bytes per entry on disk, ~50 bytes in expire's hash table.

Tests cover the public library (tombstone_hash_create, _read, _rename_for_processing, _present), SMcanceltombstone branches including expiretombstone=false bypass and unwritable-pathdb non-crash, two-file merge with dedup, comment-line skip, HISexpire integration with a callback that mirrors EXPdoline's full decision tree (kept / tombstoned / selfexpire-gone / selfexpire-alive), and an end-to-end shell test that exercises sm + cancels.tombstone + expiretombstone toggling.

Default is false; sites should opt in after validating the option's behavior against their workload. Revisit default after wider testing.

@Julien-Elie
Copy link
Copy Markdown
Contributor

A great idea! Thanks for it, Kevin.
I need a bit of time to catch up on all your recent improvements. Will be done before the end of the month.

@kev009
Copy link
Copy Markdown
Contributor Author

kev009 commented May 9, 2026

This one is definitely more invasive than the expireover change (which is basically a free lunch), and it is less critical (i.e. on my spool expire takes about 27h before). But they are designed to work together, and once effective (after first run) this will make expire a seconds to minutes runtime even for a very busy site. And with that kind of speedup, the option also opens up to run these more frequently and disable nnrpdcheckart.

@Julien-Elie
Copy link
Copy Markdown
Contributor

Isn't there any need from sometimes doing a full legacy run? I am wondering whether we won't have any orphan never expired even after a few runs either with the bloom filter or this new feature.
Manually removing a file from tradspool or other edge-cases.

@kev009
Copy link
Copy Markdown
Contributor Author

kev009 commented May 9, 2026

@Julien-Elie yes, the gap is manually 'rm' something or filesystem corruption, in which case a full expire run would be necessary. We need to make sure the documentation about that is clear, these are intended for healthy spools where inn tools are used to manage the spool. If that is invalidated, the admin needs to do a patrol expire.

@kev009 kev009 force-pushed the expire-tombstone branch 2 times, most recently from 9ce35a5 to a18db3d Compare May 9, 2026 18:29
freebsd-git pushed a commit to freebsd/freebsd-ports that referenced this pull request May 17, 2026
InterNetNews/inn#340

Gated by EXPERIMENTAL option

Move all patches to local to handle merge conflicts.
@kev009
Copy link
Copy Markdown
Contributor Author

kev009 commented May 18, 2026

Actual runtime on my system. It takes a little longer than I hypothesized because I forgot the actual data copy cost of writing out a new history file. But plenty fast:

expire begin Mon May 18 04:40:40 UTC 2026: (-v1)
    Loaded 10 + 2 + 0 tombstone entries (10 unique)
    Article lines processed 68900687
    Articles retained       68900677
    Entries expired               10
    Tombstone hits                10
expire end Mon May 18 04:44:07 UTC 2026

kev009 added 2 commits May 17, 2026 23:48
Adds an opt-in (expiretombstone in inn.conf, default false)
side-channel that records every cancellation alongside the existing
storage and overview deletions, so a subsequent expire run can drop
matching history entries without doing per-article
SMretrieve(RETR_STAT) syscalls.  At billion-article scale on
tradspool/timehash this turns a billion access(2) calls into a few
thousand hash lookups.

Two log files cooperate:

  ${pathdb}/expireover.tombstone -- written by expireover after each
  successful inline SMcancel; in delayrm mode written up front and
  promoted by expirerm after fastrm succeeds.  Atomic .NEW -> final
  rename under an exclusive non-blocking fcntl POSIX lock.  A
  crashed-or-failed previous run's leftover .NEW is verified
  per-token via SMretrieve and merged into the next run, so partial
  rmfile failures cannot orphan articles on disk.

  ${pathdb}/cancels.tombstone -- appended by innd's ARTcancel and by
  sm -r through a new public SMcanceltombstone() helper.  Shared
  fcntl POSIX lock for appenders, exclusive for expire's snapshot
  rename to .processing; one retry on EAGAIN closes the appender
  vs consumer race window.  POSIX O_APPEND atomicity for sub-
  PIPE_BUF writes keeps single-line tokens from interleaving.

expire reads both files into one hashset, treats every article in
either log as already gone, and falls back to per-article SMretrieve
only for storage methods where SMprobe(SELFEXPIRE) is true (CNFS
wrap-around).  An empty-but-present tombstone correctly says "no
cancels this cycle" and skips SMretrieve for everything; truly
absent files fall back to the slow path.

The format is line-per-token; readers ignore blank and #-prefixed
lines, and expireover writes "# inn-tombstone v1" as a forward-
compatibility header.  Disk and memory footprint scale linearly:
~38 bytes per entry on disk, ~50 bytes in expire's hash table.

Tests cover the public library (tombstone_hash_create, _read,
_rename_for_processing, _present), SMcanceltombstone branches
including expiretombstone=false bypass and unwritable-pathdb
non-crash, two-file merge with dedup, comment-line skip,
HISexpire integration with a callback that mirrors EXPdoline's
full decision tree (kept / tombstoned / selfexpire-gone /
selfexpire-alive), and an end-to-end shell test that exercises
sm + cancels.tombstone + expiretombstone toggling.

Default is false; sites should opt in after validating the
option's behavior against their workload.
When both innconf->expiretombstone and PERMaccessconf->nnrpdcheckart
are true, ARTinstorebytoken now consults ${pathdb}/cancels.tombstone
before calling SMretrieve(RETR_STAT).  A token recorded in the log is
reported gone without a syscall; a token absent from the log is
trusted to still exist on storage methods that do not self-expire,
again without a syscall.  Self-expiring backends (CNFS) still go
through SMretrieve because cyclic-buffer wrap-around bypasses the
tombstone.

The same trade-off as the expire-side fast path applies: the
tombstone catches in-band cancellations (innd ARTcancel, sm -r) but
not out-of-band events (manual filesystem deletes, corruption).
Admins running nnrpdcheckart=true to catch the brief overview-vs-
storage race during innd cancel processing, or to catch sm -r
artifacts before the next expireover cleans overview, still get those
paths.  Loss is bounded to events the tombstone subsystem cannot
see, which is consistent with the contract documented for
expiretombstone.

Loading is lazy: the hashset is built on first call into
ARTinstorebytoken and the file is stat()ed on every subsequent call
to detect change, with reload only when mtime or size changes.
Statting a fixed path stays in the dentry cache and is trivially
cheap compared to the per-article SMretrieve syscalls the fast path
elides, so cancellations recorded by other processes become visible
to long-lived connections immediately.  Freshness is keyed on both
mtime and size: mtime alone has 1-second granularity, so a rename-
and-recreate within the same second can land a fresh inode with
the cached mtime; size is monotonic-append between rotations and
catches every modification.  Partial reads (mid-file ferror) leave
the cached freshness key untouched so the next call retries.

The SMprobe(SELFEXPIRE) probe is checked before the tombstone
refresh so pure-CNFS sites pay only that probe (a static per-method
attribute, no I/O) and skip both the hash lookup and the cache load
entirely.

Memory cost is dominated by the hashset (~50 bytes per cancel;
typical sites have hundreds of entries).  When the file is missing
entirely, expiretombstone is off, or nnrpdcheckart is off, the
original SMretrieve path is unchanged.

inn.conf.pod's nnrpdcheckart entry now describes the fast-path
gating, the mtime/size-driven reload model, the manual-rm caveat
on non-self-expiring backends, and the OVERartcheck counter
becoming an undercount when the fast path is active.
@kev009 kev009 force-pushed the expire-tombstone branch from a18db3d to b40b281 Compare May 18, 2026 06:52
@Julien-Elie Julien-Elie self-assigned this May 20, 2026
@Julien-Elie Julien-Elie added this to the 2.7.4 milestone May 20, 2026
@Julien-Elie Julien-Elie added enhancement New feature or request C: history Related to history P: medium Medium priority labels May 20, 2026
Julien-Elie pushed a commit that referenced this pull request May 24, 2026
Adds an opt-in (expiretombstone in inn.conf, default false)
side-channel that records every cancellation alongside the existing
storage and overview deletions, so a subsequent expire run can drop
matching history entries without doing per-article
SMretrieve(RETR_STAT) syscalls.  At billion-article scale on
tradspool/timehash this turns a billion access(2) calls into a few
thousand hash lookups.

Two log files cooperate:

  ${pathdb}/expireover.tombstone -- written by expireover after each
  successful inline SMcancel; in delayrm mode written up front and
  promoted by expirerm after fastrm succeeds.  Atomic .NEW -> final
  rename under an exclusive non-blocking fcntl POSIX lock.  A
  crashed-or-failed previous run's leftover .NEW is verified
  per-token via SMretrieve and merged into the next run, so partial
  rmfile failures cannot orphan articles on disk.

  ${pathdb}/cancels.tombstone -- appended by innd's ARTcancel and by
  sm -r through a new public SMcanceltombstone() helper.  Shared
  fcntl POSIX lock for appenders, exclusive for expire's snapshot
  rename to .processing; one retry on EAGAIN closes the appender
  vs consumer race window.  POSIX O_APPEND atomicity for sub-
  PIPE_BUF writes keeps single-line tokens from interleaving.

expire reads both files into one hashset, treats every article in
either log as already gone, and falls back to per-article SMretrieve
only for storage methods where SMprobe(SELFEXPIRE) is true (CNFS
wrap-around).  An empty-but-present tombstone correctly says "no
cancels this cycle" and skips SMretrieve for everything; truly absent
files fall back to the slow path.

The format is line-per-token; readers ignore blank and #-prefixed
lines, and expireover writes "# inn-tombstone v1" as a forward-
compatibility header.  Disk and memory footprint scale linearly:
~38 bytes per entry on disk, ~50 bytes in expire's hash table.

Tests cover the public library (tombstone_hash_create, _read,
_rename_for_processing, _present), SMcanceltombstone branches
including expiretombstone=false bypass and unwritable-pathdb
non-crash, two-file merge with dedup, comment-line skip,
HISexpire integration with a callback that mirrors EXPdoline's
full decision tree (kept / tombstoned / selfexpire-gone /
selfexpire-alive), and an end-to-end shell test that exercises
sm + cancels.tombstone + expiretombstone toggling.

Default is false; sites should opt in after validating the
option's behavior against their workload.

see #340
Julien-Elie pushed a commit that referenced this pull request May 24, 2026
When both innconf->expiretombstone and PERMaccessconf->nnrpdcheckart
are true, ARTinstorebytoken now consults ${pathdb}/cancels.tombstone
before calling SMretrieve(RETR_STAT).  A token recorded in the log is
reported gone without a syscall; a token absent from the log is
trusted to still exist on storage methods that do not self-expire,
again without a syscall.  Self-expiring backends (CNFS) still go
through SMretrieve because cyclic-buffer wrap-around bypasses the
tombstone.

The same trade-off as the expire-side fast path applies: the
tombstone catches in-band cancellations (innd ARTcancel, sm -r) but
not out-of-band events (manual filesystem deletes, corruption).
Admins running nnrpdcheckart=true to catch the brief overview-vs-
storage race during innd cancel processing, or to catch sm -r
artifacts before the next expireover cleans overview, still get those
paths.  Loss is bounded to events the tombstone subsystem cannot
see, which is consistent with the contract documented for
expiretombstone.

Loading is lazy: the hashset is built on first call into
ARTinstorebytoken and the file is stat()ed on every subsequent call
to detect change, with reload only when mtime or size changes.
Statting a fixed path stays in the dentry cache and is trivially
cheap compared to the per-article SMretrieve syscalls the fast path
elides, so cancellations recorded by other processes become visible
to long-lived connections immediately.  Freshness is keyed on both
mtime and size: mtime alone has 1-second granularity, so a rename-
and-recreate within the same second can land a fresh inode with
the cached mtime; size is monotonic-append between rotations and
catches every modification.  Partial reads (mid-file ferror) leave
the cached freshness key untouched so the next call retries.

The SMprobe(SELFEXPIRE) probe is checked before the tombstone
refresh so pure-CNFS sites pay only that probe (a static per-method
attribute, no I/O) and skip both the hash lookup and the cache load
entirely.

Memory cost is dominated by the hashset (~50 bytes per cancel;
typical sites have hundreds of entries).  When the file is missing
entirely, expiretombstone is off, or nnrpdcheckart is off, the
original SMretrieve path is unchanged.

inn.conf.pod's nnrpdcheckart entry now describes the fast-path
gating, the mtime/size-driven reload model, the manual-rm caveat
on non-self-expiring backends, and the OVERartcheck counter
becoming an undercount when the fast path is active.

see #340
Julien-Elie added a commit that referenced this pull request May 24, 2026
Julien-Elie added a commit that referenced this pull request May 24, 2026
@Julien-Elie
Copy link
Copy Markdown
Contributor

A very clean and thoughtful patch. Thanks again Kevin for your work!

I'm wondering whether the 2 expiretombstone-without-groupbaseexpiry checks wouldn't be better in the innconf_validate() function in lib/innconf.c so that it is directly noticed by the news admin. I can move it there if you also think the consistency should be done there.
Anyway, thanks for having taking care of bypassing the SMcanceltombstone function if not correctly configured.

As for the use of sm -r, if I understand well the new usage to have a consistent ownership for cancels.tombstone, shouldn't we say in the sm documentation that when expiretombstone is enabled, the program needs being run as the news user and group?

@Julien-Elie
Copy link
Copy Markdown
Contributor

Julien-Elie commented May 24, 2026

Like for the improvements to expireover, I'll be glad to hear from other news admins who run large news spools about how expiretombstone performs.
I do not have enough articles to obtain a benefit for my usage (only a couple of seconds), but hopefully it helps others!
As soon as the issue about tagged hash history is fixed, we'll have a snapshot generated for CURRENT that I'll advertise for wider testings before merging them into STABLE for the INN 2.7.4 release.

Without expiretombstone:

% time expire -N -v1
Article lines processed  4108518
Articles retained        4108518
Entries expired                0
expire -N -v1  7,16s user 2,94s system 99% cpu 10,145 total

With expiretombstone:

% time expire -N -v1
Loaded 0 + 0 + 0 tombstone entries (0 unique)
Article lines processed  4108519
Articles retained        4108519
Entries expired                0
Tombstone hits                 0
expire -N -v1  7,10s user 1,51s system 99% cpu 8,656 total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C: history Related to history enhancement New feature or request P: medium Medium priority

Development

Successfully merging this pull request may close these issues.

2 participants