Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,8 @@
/tests/innd/chan.t
/tests/lib/artnumber.t
/tests/lib/asprintf.t
/tests/lib/bloom.t
/tests/lib/bloom-hiswalk.t
/tests/lib/buffer.t
/tests/lib/canlock.t
/tests/lib/concat.t
Expand Down
4 changes: 4 additions & 0 deletions MANIFEST
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,7 @@ include/Makefile Makefile for header files
include/conffile.h Header file for reading *.conf files
include/config.h.in Template configuration data
include/inn Installed header files (Directory)
include/inn/bloom.h Header file for bloom filter
include/inn/buffer.h Header file for reusable counted buffers
include/inn/concat.h Header file for string concatenation
include/inn/confparse.h Header file for configuration parser
Expand Down Expand Up @@ -518,6 +519,7 @@ lib/Makefile Makefile for library
lib/argparse.c Functions for parsing arguments
lib/artnumber.c Manipulation of article numbers
lib/asprintf.c asprintf replacement
lib/bloom.c Bloom filter implementation
lib/buffer.c Reusable counted buffer
lib/canlock.c Routines for Cancel-Lock
lib/cleanfrom.c Clean out a From line
Expand Down Expand Up @@ -938,6 +940,8 @@ tests/innd/fakeinnd.c Provide symbols defined by innd/innd.c
tests/lib Test suite for libinn (Directory)
tests/lib/artnumber-t.c Tests for lib/artnumber.c
tests/lib/asprintf-t.c Tests for lib/asprintf.c
tests/lib/bloom-hiswalk-t.c Integration test for bloom filter with HISwalk
tests/lib/bloom-t.c Tests for lib/bloom.c
tests/lib/buffer-t.c Tests for lib/buffer.c
tests/lib/canlock-t.c Tests for lib/canlock.c
tests/lib/concat-t.c Tests for lib/concat.c
Expand Down
16 changes: 16 additions & 0 deletions doc/pod/expireover.pod
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,22 @@ By default, B<expireover> purges all overview information for newsgroups
that have been removed from the server; this behavior is suppressed if
B<-f> is given.

To speed up the existence check for each article, B<expireover> builds
a bloom filter from the history file at startup. This replaces
per-article random I/O into the history file with a single sequential
read, which is critical for large spools. The bloom filter is a
positive-only cache: if it reports an article probably exists, the
slow history lookup is skipped. If it reports the article is not found,
B<expireover> falls back to a direct history lookup for correctness.
False positives (the bloom filter incorrectly reporting an article
exists) are benign; the orphaned overview entry will be cleaned up on
the next expiration run.

The false positive rate and memory usage of the bloom filter are
controlled by the I<expirebloomfp> setting in F<inn.conf>. Setting it
to C<0> disables the bloom filter. The bloom filter is also disabled
when the B<-s> flag is used.

=head1 OPTIONS

=over 4
Expand Down
20 changes: 20 additions & 0 deletions doc/pod/inn.conf.pod
Original file line number Diff line number Diff line change
Expand Up @@ -526,6 +526,26 @@ will run much faster, but reading news from the system will be impossible
true, I<ovmethod> must also be set. This is a boolean value and the
default is true.

=item I<expirebloomfp>

Controls the bloom filter used by B<expireover> to accelerate overview
expiration. The value is the reciprocal of the desired false positive
rate: for example, C<10000> means a 1-in-10,000 (0.01%) false positive
rate. Higher values use more memory but produce fewer false positives.
At the default of C<10000>, memory usage is approximately 20 bits per
article in the history file (e.g., 48S<MB> for 20 million articles,
2.4S<GB> for 1 billion articles).

Setting this to C<0> disables the bloom filter entirely, falling back
to per-article history lookups (the pre-existing behavior). This is
not recommended for large spools as it results in random I/O into the
history file for every article in the overview database.

The bloom filter has no effect when B<expireover> is run with the B<-s>
flag (which forces a filesystem stat of every article).

This is a non-negative integer and the default is C<10000>.

=item I<extraoverviewadvertised>

Besides the seven standard overview fields (which are in order C<Subject>,
Expand Down
15 changes: 9 additions & 6 deletions doc/pod/libinnhist.pod
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,9 @@ his - routines for managing INN history

bool HISwalk(struct history *history, const char *reason,
void *cookie,
bool (*callback)(void *cookie, time_t arrived,
time_t posted, time_t expires,
bool (*callback)(void *cookie, const HASH *hash,
time_t arrived, time_t posted,
time_t expires,
const TOKEN *token));

struct histstats HISstats(struct history *history);
Expand Down Expand Up @@ -210,10 +211,12 @@ unspecified.

B<HISwalk> provides an iteration function for the specified I<history>
database. For every entry in the history database, I<callback> is
invoked, passing the I<cookie>, arrival, posting, and expiry times, in
addition to the token associated with the entry. If the I<callback>()
returns B<false> the iteration is aborted and B<HISwalk> returns
B<false> to the caller.
invoked, passing the I<cookie>, the message-ID I<hash>, arrival,
posting, and expiry times, in addition to the token associated with
the entry. If the entry has no storage token (a remembered
message-ID), I<token> is B<NULL>. If the I<callback>() returns
B<false> the iteration is aborted and B<HISwalk> returns B<false> to
the caller. Malformed history lines are silently skipped.

To process the entire database in the presence of a running server,
I<reason> may be passed; if this argument is not B<NULL>, it is used
Expand Down
61 changes: 59 additions & 2 deletions expire/expireover.c
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@
#include <syslog.h>
#include <time.h>

#include <sys/stat.h>

#include "inn/bloom.h"
#include "inn/history.h"
#include "inn/innconf.h"
#include "inn/libinn.h"
#include "inn/messages.h"
Expand Down Expand Up @@ -45,6 +49,22 @@ fatal_signal(int sig)
}


/*
** Callback for HISwalk that adds history entries with storage tokens to the
** bloom filter. Entries without tokens (remembered message-IDs) are skipped
** so that OVhisthasmsgid correctly identifies them as missing.
*/
static bool
build_bloom_cb(void *cookie, const HASH *hash,
time_t arrived UNUSED, time_t posted UNUSED,
time_t expires UNUSED, const TOKEN *token)
{
if (token != NULL)
bloom_add(cookie, hash);
return true;
}


int
main(int argc, char *argv[])
{
Expand All @@ -60,6 +80,8 @@ main(int argc, char *argv[])
bool purge_deleted = false;
bool always_stat = false;
struct history *history;
struct bloom_filter *bloom = NULL;
struct bloom_filter *null_bloom = NULL;

/* First thing, set up logging and our identity. */
openlog("expireover", L_OPENLOG_FLAGS | LOG_PID, LOG_INN_PROG);
Expand Down Expand Up @@ -181,12 +203,43 @@ main(int argc, char *argv[])
if (!OVctl(OVSTATALL, &always_stat))
die("can't configure overview stat behavior");

/* We want to be careful about being interrupted from this point on, so
set up our signal handlers. */
/* Set up signal handlers before the bloom walk, which can take several
minutes on very large history files. */
xsignal(SIGTERM, fatal_signal);
xsignal(SIGINT, fatal_signal);
xsignal(SIGHUP, fatal_signal);

/* Build a bloom filter from the history file for fast existence checks.
This replaces millions of random pread() calls into the history file
with a single sequential read, making expireover feasible on large
spools (1B+ articles). The bloom filter is used as a positive-only
cache: hits skip the slow history lookup, misses fall through to
HISlookup for correctness (handles articles added after the walk). */
if (innconf->expirebloomfp > 0 && !always_stat) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Julien-Elie I wonder if we add another chicken bit here and only do it for tradspool/timehash etc?

Copy link
Copy Markdown
Contributor Author

@kev009 kev009 May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the times with the default fp look reasonable so maybe need to see more data points to see if i.e. CNFS should completely opt out.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The times are similar so maybe it's not worth adding extra complexity at this point to probe whether articles are in a self-expiring storage method and the -N flag is not used.

FWIW, without the -N flag, the time of run of expireover is still similar in my news spool (about 46 seconds for 416 000 articles).

struct stat st;
char *histpath;
size_t estimated = 0;
/* Minimum history line: 34 (hash) + 1 (tab) + 1 (arrived)
* + 1 (newline). Dividing file size by this gives a conservative
* overestimate of entries, which is what we want for bloom sizing. */
const size_t min_history_line = 37;

histpath = concatpath(innconf->pathdb, INN_PATH_HISTORY);
if (stat(histpath, &st) == 0)
estimated = st.st_size / min_history_line;
else
warn("can't stat %s, bloom filter will be undersized", histpath);
bloom = bloom_create(estimated, innconf->expirebloomfp);
if (!HISwalk(history, NULL, bloom, build_bloom_cb)) {
warn("can't walk history for bloom filter, using per-article"
" lookups");
bloom_free(bloom);
bloom = NULL;
}
OVctl(OVTOKENCACHE, &bloom);
free(histpath);
}

/* Loop through each line of the input file and process each group,
writing data to the lowmark file if desired. */
line = QIOread(qp);
Expand All @@ -212,6 +265,10 @@ main(int argc, char *argv[])
warn("can't expire deleted newsgroups");

/* Close everything down in an orderly fashion. */
if (bloom) {
OVctl(OVTOKENCACHE, &null_bloom);
bloom_free(bloom);
}
QIOclose(qp);
OVclose();
SMshutdown();
Expand Down
3 changes: 2 additions & 1 deletion history/his.c
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,8 @@ HISreplace(struct history *h, const char *key, time_t arrived, time_t posted,

bool
HISwalk(struct history *h, const char *reason, void *cookie,
bool (*callback)(void *, time_t, time_t, time_t, const TOKEN *))
bool (*callback)(void *, const HASH *, time_t, time_t, time_t,
const TOKEN *))
{
bool r;

Expand Down
3 changes: 2 additions & 1 deletion history/hisinterface.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#define HISINTERFACE_H

#include "config.h"
#include "inn/libinn.h"
#include <sys/types.h>

struct token;
Expand All @@ -27,7 +28,7 @@ typedef struct hismethod {
bool (*expire)(void *, const char *, const char *, bool, void *, time_t,
bool (*)(void *, time_t, time_t, time_t, struct token *));
bool (*walk)(void *, const char *, void *,
bool (*)(void *, time_t, time_t, time_t,
bool (*)(void *, const HASH *, time_t, time_t, time_t,
const struct token *));
bool (*remember)(void *, const char *, time_t, time_t);
bool (*ctl)(void *, int, void *);
Expand Down
3 changes: 2 additions & 1 deletion history/hisv6/hisv6-private.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,8 @@ struct hisv6 {
struct hisv6_walkstate {
union {
bool (*expire)(void *, time_t, time_t, time_t, TOKEN *);
bool (*walk)(void *, time_t, time_t, time_t, const TOKEN *);
bool (*walk)(void *, const HASH *, time_t, time_t, time_t,
const TOKEN *);
} cb;
void *cookie;
bool paused;
Expand Down
15 changes: 10 additions & 5 deletions history/hisv6/hisv6.c
Original file line number Diff line number Diff line change
Expand Up @@ -1071,14 +1071,14 @@ hisv6_traverse(struct hisv6 *h, struct hisv6_walkstate *cookie,
** parameters the user callback expects
**/
static bool
hisv6_traversecb(struct hisv6 *h UNUSED, void *cookie, const HASH *hash UNUSED,
hisv6_traversecb(struct hisv6 *h UNUSED, void *cookie, const HASH *hash,
time_t arrived, time_t posted, time_t expires,
const TOKEN *token)
{
struct hisv6_walkstate *hiscookie = cookie;

return (*hiscookie->cb.walk)(hiscookie->cookie, arrived, posted, expires,
token);
return (*hiscookie->cb.walk)(hiscookie->cookie, hash, arrived, posted,
expires, token);
}


Expand All @@ -1087,7 +1087,8 @@ hisv6_traversecb(struct hisv6 *h UNUSED, void *cookie, const HASH *hash UNUSED,
*/
bool
hisv6_walk(void *history, const char *reason, void *cookie,
bool (*callback)(void *, time_t, time_t, time_t, const TOKEN *))
bool (*callback)(void *, const HASH *, time_t, time_t, time_t,
const TOKEN *))
{
struct hisv6 *h = history;
struct hisv6_walkstate hiscookie;
Expand All @@ -1099,7 +1100,11 @@ hisv6_walk(void *history, const char *reason, void *cookie,
hiscookie.cookie = cookie;
hiscookie.new = NULL;
hiscookie.paused = false;
hiscookie.ignore = false;
/* Ignore malformed history lines during walk. The walk is a read-only
operation (e.g., building a bloom filter for expireover); aborting
over one corrupt line in a potentially 180 GB file would be
catastrophic. expire is what fixes corrupt history entries. */
hiscookie.ignore = true;

r = hisv6_traverse(h, &hiscookie, reason, hisv6_traversecb);

Expand Down
4 changes: 3 additions & 1 deletion history/hisv6/hisv6.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
#ifndef HISV6_H
#define HISV6_H

#include "inn/libinn.h"

struct token;
struct histopts;
struct history;
Expand Down Expand Up @@ -32,7 +34,7 @@ bool hisv6_expire(void *, const char *, const char *, bool, void *,
struct token *));

bool hisv6_walk(void *, const char *, void *,
bool (*)(void *, time_t, time_t, time_t,
bool (*)(void *, const HASH *, time_t, time_t, time_t,
const struct token *));

const char *hisv6_error(void *);
Expand Down
69 changes: 69 additions & 0 deletions include/inn/bloom.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
/*
** Bloom filter for fast set membership testing.
**
** A space-efficient probabilistic data structure that can test whether
** an element is a member of a set. False positive matches are possible,
** but false negatives are not: a query returns either "possibly in set"
** or "definitely not in set."
**
** Uses enhanced double hashing (Kirsch & Mitzenmacher 2006) to derive
** multiple hash positions from a single HASH value.
*/

#ifndef INN_BLOOM_H
#define INN_BLOOM_H

#include "inn/libinn.h"
#include "inn/portable-macros.h"
#include "inn/portable-stdbool.h"

#include <stddef.h>

BEGIN_DECLS

/* The layout of this struct is entirely internal to the implementation. */
struct bloom_filter;

/*
** Create a new bloom filter sized for the given number of estimated entries
** and false positive rate expressed as a reciprocal (e.g., 10000 means
** 1-in-10,000 or 0.01% false positive rate). Uses xmalloc internally,
** so dies on allocation failure.
*/
struct bloom_filter *bloom_create(size_t estimated_entries, unsigned long fp_inv);

/*
** Add a HASH to the bloom filter.
*/
void bloom_add(struct bloom_filter *bf, const HASH *hash);

/*
** Check whether a HASH is possibly in the bloom filter. Returns true if
** the element is probably in the set (with false positive rate as configured),
** or false if the element is definitely not in the set.
*/
bool bloom_check(const struct bloom_filter *bf, const HASH *hash);

/*
** Free a bloom filter and all associated memory. Safe to call with NULL.
*/
void bloom_free(struct bloom_filter *bf);

/*
** Return the number of entries that have been added to the bloom filter.
*/
size_t bloom_count(const struct bloom_filter *bf);

/*
** Return the number of hash functions (k) used by the bloom filter.
*/
unsigned int bloom_nhash(const struct bloom_filter *bf);

/*
** Return the total number of bits (m) in the bloom filter.
*/
size_t bloom_bits(const struct bloom_filter *bf);

END_DECLS

#endif /* INN_BLOOM_H */
4 changes: 3 additions & 1 deletion include/inn/history.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#ifndef INN_HISTORY_H
#define INN_HISTORY_H

#include "inn/libinn.h"
#include "inn/macros.h"
#include "inn/portable-stdbool.h"
#include <sys/types.h>
Expand Down Expand Up @@ -96,7 +97,8 @@ bool HISexpire(struct history *, const char *, const char *, bool, void *,
time_t,
bool (*)(void *, time_t, time_t, time_t, struct token *));
bool HISwalk(struct history *, const char *, void *,
bool (*)(void *, time_t, time_t, time_t, const struct token *));
bool (*)(void *, const HASH *, time_t, time_t, time_t,
const struct token *));
struct histstats HISstats(struct history *);
const char *HISerror(struct history *);
bool HISctl(struct history *, int, void *);
Expand Down
Loading
Loading