Skip to content

Threaded depmod, take 2#425

Draft
evelikov wants to merge 4 commits into
kmod-project:masterfrom
evelikov:threaded-depmod
Draft

Threaded depmod, take 2#425
evelikov wants to merge 4 commits into
kmod-project:masterfrom
evelikov:threaded-depmod

Conversation

@evelikov
Copy link
Copy Markdown
Collaborator

This is a retake of #412, the slimmed down version in particular.

The changes:

  • split out report error on array_append failure into separate patch

  • reuse struct symbol, instead or rolling our own

  • split/reuse symbol creation - 2 times fewer allocations, honour sym_prefix

  • don't allocate the struct symbol(s) twice

  • consolidate cleanup into a helper function

  • remove unused members in load_modules_work, rename

  • swap modules, start and end for modules(_start) and count

  • store thread id and "work" struct in single place - nproc fewerer calloc

  • properly check for errors on pthread_create/join

In summary, the numbers are as follows:

master MR, nproc 1 MR (nproc 8), CPUs used compression
0.44 s 0.44 s 0.73 s, 1.5 none
4.45 s 4.45 s 1.40 s, 4.3 xz
1.05 s 1.04 s 0.62 s, 2.9 zst

Which are mostly identical to previously, with a notable exception for uncompressed modules. Unsurprisingly adding extra threads damages performance.

Some nproc numbers from an aging 4 core + HT laptop.

nproc none zst
1 0.44 1.04
2 0.39 0.73
4 0.54 0.63
8 0.73 0.62
Some `perf stat -r 100` numbers

Even with the 100 repeat, the numbers are not that stable/reliable.

compression stat master MR nproc 1 slim MR
uncompressed task-clock 435,749,999 435,047,972 1,082,318,930 1,087,043,546
uncompressed instructions 1,213,446,151 1,202,426,481 1,193,567,961 1,192,806,832
uncompressed time elapsed 0.43846 +- 0.00177 0.43802 +- 0.00184 0.72881 +- 0.00417 0.73283 +- 0.00318
------------ ------ -------- ------------ ------ ----
xz task-clock 4,450,100,429 4,444,117,416 6,063,143,589 6,057,782,176
xz instructions 27,467,626,437 27,439,935,695 27,414,581,685 27,409,414,470
xz time elapsed 4.45702 +- 0.00579 4.45167 +- 0.00663 1.40124 +- 0.00259 1.40324 +- 0.00255
------------ ------ -------- ------------ ------ ----
zst task-clock 1,047,500,349 1,038,654,591 1,836,694,522 1,810,041,398
zst instructions 6,941,232,755 6,976,108,698 6,904,710,582 6,904,091,683
zst time elapsed 1.05134 +- 0.00459 1.04285 +- 0.00338 0.63994 +- 0.00215 0.62171 +- 0.00400
Valgrind memory stats
allocations bytes compression build
1,295,300 175,206,192 uncompressed master
1,295,407 180,905,088 uncompressed MR nproc 1
1,322,139 176,862,003 uncompressed slim
1,295,462 176,150,040 uncompressed MR
----------- ------------- -------------- ----
1,404,387 53,370,466,853 xz master
1,404,494 53,376,165,749 xz MR nproc 1
1,431,226 53,372,122,664 xz slim
1,404,549 53,371,410,701 xz MR
----------- ------------- -------------- ----
1,306,364 1,169,580,231 zst master
1,306,471 1,175,279,127 zst MR nproc 1
1,333,203 1,171,236,042 zst slim
1,306,526 1,170,524,079 zst MR

Summary, relative to master

  • MR, nproc 1 - extra ~100 allocs, ~5.6M bytes in total
  • Slim - extra ~27k allocs, ~1.5M bytes in total
  • MR - extra ~160 allocs, ~1M bytes in total

The 27k -> 160 changes are expected since we reuse struct symbols although why we get an extra 5MB memory for nproc is beyond me 🤷

evelikov and others added 4 commits February 13, 2026 18:26
Currently, in the (unlikely) case of array_append failing, we'll return
success and thus partial output.

Fixes: 823849a ("tools/depmod: use separate arrays for alias,xxxdep values")
Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
With a later commit, we'll need to create symbols outside of the
depmod_symbol_add function, so split the function in two.

Aside: depmod_symbol_add should really get some error checking, that
we're not helping with here... Also we should consider using
hash_add_unique() since currently a duplicate symbol will override the
previous one.

But all of that can happen at a later date.

Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Split out the module symbol resolution - effectively, the initial
decompression of the module and elf parsing - from adding the data into
the depmod hashmap.

This will come in handy with the next commit, which will delegate the
symbol resolution (and thus the slowest part, decompression) to threads.

Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Add pthread-based parallelization to depmod to improve performance on
multi-core systems. The changes parallelize:
- Module symbol loading

Co-authored-by: jared mauch <jared@puck.nether.net>
Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Comment thread tools/depmod.c

for (unsigned int i = 0; i < n_threads; i++) {
struct module_symbols *mod_syms = &tinfo[i].mod_syms;

Check warning

Code scanning / CodeQL

Local variable address stored in non-local memory Warning

A stack address which arrived via a
parameter
may be assigned to a non-local variable.
@evelikov
Copy link
Copy Markdown
Collaborator Author

Seems like the CI is failing, even if it's working fine locally. Will check at some point tomorrow/day after.

In the meanwhile, do people have a preference about:

  • exposing the number of threads to users
  • adjusting the default to nproc/2
  • keeping/nuking/wrapping the (void *)(intptr_t) casts

@evelikov
Copy link
Copy Markdown
Collaborator Author

Kicking off helgrind reminded me that we mutate kmod_ctx for each new kmod_module instance. In that the former has a pool (hash map) with all the modules.

We can resolve, at least some of, the problem by deferring the hash-entry removal(s) aka kmod_module_unref to the main thread, which means we don't free until the end of the program - easily reaching 0.5GB of memory. Something we cannot do in the general case.

So as a whole, we could look on making hash implementation lock-free... And as a small aside, to convert the refcounting to use atomics.

To make things even more fun, helgrind is also unhappy with OpenSSL - haven't looked why/how to be honest. It can be side-stepped by a) having a more granular kmod_module_get_info variant which doesn't parse the crypto bits and/or b) adding an alternative code-path to OpenSSL.

Since I don't know if I'll be looking at it, I'm going to park this as draft.

@evelikov evelikov marked this pull request as draft February 14, 2026 13:57
@evelikov
Copy link
Copy Markdown
Collaborator Author

If OpenSSL/helgrind proves to be a blocking factor, it can potentially be side-stepped by #426.

Comment thread tools/depmod.c
struct thread_info {
pthread_t tid;
struct module_symbols mod_syms;
} *tinfo;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about making mod_syms a thread_local, but... we we will already need to allocate the array for keeping the tid

Comment thread tools/depmod.c
if (local_err != 0)
err = -local_err;
if ((int)(intptr_t)res != 0)
err = (int)(intptr_t)res;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we worry about the other threads continuing even if one of them already falied? I guess we can keep it simple, otherwise I think we'd need to add a cond var just for this.

Comment thread tools/depmod.c
static int depmod_symbol_add(struct depmod *depmod, struct symbol *sym)
{
int err = hash_add(depmod->symbols, sym->name, sym);
if (err < 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we keep the free(sym) here to avoid leaking it?

Comment thread tools/depmod.c

modules_per_thread = (depmod->modules.count + n_threads - 1) / n_threads;
last_modules_per_thread =
modules_per_thread - (depmod->modules.count % n_threads);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 modules, 4 threads:

modules_per_thread == (11 + 3) / 4 == 3
last_modules_per_thread = 3 -  (11 % 4) = 0

I think you wanted:

last_modules_per_thread = depmod->modules.count - n_threads * modules_per_thread;

so the last thread has potentially fewer modules to calculate:

tinfo[0].mod_syms.modules_count = 3
tinfo[1].mod_syms.modules_count = 3
tinfo[2].mod_syms.modules_count = 3
tinfo[3].mod_syms.modules_count = 2

Comment thread tools/depmod.c
mod_syms->depmod = depmod;
mod_syms->modules =
(struct mod **)depmod->modules.array + (i * modules_per_thread);
mod_syms->modules_count = modules_per_thread;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humn... assuming the threads are working on different modules, and thus have access to different kmod_mod, I think it should be fine to call into kmod from a concurrency perspective. One place I'm not so sure is when freeing the module - because that puts it back in the pool and there may be a synchronization issue with another thread creating the module.

The other issue would be logging, but that I think we could ignore.

Maybe it would be better from a correctness perspective to clone the kmod_ctx per thread?

@lucasdemarchi
Copy link
Copy Markdown
Contributor

Kicking off helgrind reminded me that we mutate kmod_ctx for each new kmod_module instance. In that the former has a pool (hash map) with all the modules.

We can resolve, at least some of, the problem by deferring the hash-entry removal(s) aka kmod_module_unref to the main thread, which means we don't free until the end of the program - easily reaching 0.5GB of memory. Something we cannot do in the general case.

So as a whole, we could look on making hash implementation lock-free... And as a small aside, to convert the refcounting to use atomics.

sorry, just saw this comment after reviewing... you had already caught that issue. But this solution still seems fragile - we are breaking the rule of not calling into libkmod from multiple threads. I think just cloning the ctx would be better. This is what e.g. udev does to handle multithread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants