Threaded depmod, take 2 by evelikov · Pull Request #425 · kmod-project/kmod

evelikov · 2026-02-13T19:40:33Z

This is a retake of #412, the slimmed down version in particular.

The changes:

split out report error on array_append failure into separate patch
reuse struct symbol, instead or rolling our own
split/reuse symbol creation - 2 times fewer allocations, honour sym_prefix
don't allocate the struct symbol(s) twice
consolidate cleanup into a helper function
remove unused members in load_modules_work, rename
swap modules, start and end for modules(_start) and count
store thread id and "work" struct in single place - nproc fewerer calloc
properly check for errors on pthread_create/join

In summary, the numbers are as follows:

master	MR, nproc 1	MR (nproc 8), CPUs used	compression
0.44 s	0.44 s	0.73 s, 1.5	none
4.45 s	4.45 s	1.40 s, 4.3	xz
1.05 s	1.04 s	0.62 s, 2.9	zst

Which are mostly identical to previously, with a notable exception for uncompressed modules. Unsurprisingly adding extra threads damages performance.

Some nproc numbers from an aging 4 core + HT laptop.

nproc	none	zst
1	0.44	1.04
2	0.39	0.73
4	0.54	0.63
8	0.73	0.62

Some `perf stat -r 100` numbers

Even with the 100 repeat, the numbers are not that stable/reliable.

compression	stat	master	MR nproc 1	slim	MR
uncompressed	task-clock	435,749,999	435,047,972	1,082,318,930	1,087,043,546
uncompressed	instructions	1,213,446,151	1,202,426,481	1,193,567,961	1,192,806,832
uncompressed	time elapsed	0.43846 +- 0.00177	0.43802 +- 0.00184	0.72881 +- 0.00417	0.73283 +- 0.00318
------------	------	--------	------------	------	----
xz	task-clock	4,450,100,429	4,444,117,416	6,063,143,589	6,057,782,176
xz	instructions	27,467,626,437	27,439,935,695	27,414,581,685	27,409,414,470
xz	time elapsed	4.45702 +- 0.00579	4.45167 +- 0.00663	1.40124 +- 0.00259	1.40324 +- 0.00255
------------	------	--------	------------	------	----
zst	task-clock	1,047,500,349	1,038,654,591	1,836,694,522	1,810,041,398
zst	instructions	6,941,232,755	6,976,108,698	6,904,710,582	6,904,091,683
zst	time elapsed	1.05134 +- 0.00459	1.04285 +- 0.00338	0.63994 +- 0.00215	0.62171 +- 0.00400

Valgrind memory stats

allocations	bytes	compression	build
1,295,300	175,206,192	uncompressed	master
1,295,407	180,905,088	uncompressed	MR nproc 1
1,322,139	176,862,003	uncompressed	slim
1,295,462	176,150,040	uncompressed	MR
-----------	-------------	--------------	----
1,404,387	53,370,466,853	xz	master
1,404,494	53,376,165,749	xz	MR nproc 1
1,431,226	53,372,122,664	xz	slim
1,404,549	53,371,410,701	xz	MR
-----------	-------------	--------------	----
1,306,364	1,169,580,231	zst	master
1,306,471	1,175,279,127	zst	MR nproc 1
1,333,203	1,171,236,042	zst	slim
1,306,526	1,170,524,079	zst	MR

Summary, relative to master

MR, nproc 1 - extra ~100 allocs, ~5.6M bytes in total
Slim - extra ~27k allocs, ~1.5M bytes in total
MR - extra ~160 allocs, ~1M bytes in total

The 27k -> 160 changes are expected since we reuse struct symbols although why we get an extra 5MB memory for nproc is beyond me 🤷

Currently, in the (unlikely) case of array_append failing, we'll return success and thus partial output. Fixes: 823849a ("tools/depmod: use separate arrays for alias,xxxdep values") Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>

With a later commit, we'll need to create symbols outside of the depmod_symbol_add function, so split the function in two. Aside: depmod_symbol_add should really get some error checking, that we're not helping with here... Also we should consider using hash_add_unique() since currently a duplicate symbol will override the previous one. But all of that can happen at a later date. Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>

Split out the module symbol resolution - effectively, the initial decompression of the module and elf parsing - from adding the data into the depmod hashmap. This will come in handy with the next commit, which will delegate the symbol resolution (and thus the slowest part, decompression) to threads. Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>

Add pthread-based parallelization to depmod to improve performance on multi-core systems. The changes parallelize: - Module symbol loading Co-authored-by: jared mauch <jared@puck.nether.net> Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>

+
+	for (unsigned int i = 0; i < n_threads; i++) {
+		struct module_symbols *mod_syms = &tinfo[i].mod_syms;
+


evelikov · 2026-02-13T19:51:32Z

Seems like the CI is failing, even if it's working fine locally. Will check at some point tomorrow/day after.

In the meanwhile, do people have a preference about:

exposing the number of threads to users
adjusting the default to nproc/2
keeping/nuking/wrapping the (void *)(intptr_t) casts

evelikov · 2026-02-14T13:57:47Z

Kicking off helgrind reminded me that we mutate kmod_ctx for each new kmod_module instance. In that the former has a pool (hash map) with all the modules.

We can resolve, at least some of, the problem by deferring the hash-entry removal(s) aka kmod_module_unref to the main thread, which means we don't free until the end of the program - easily reaching 0.5GB of memory. Something we cannot do in the general case.

So as a whole, we could look on making hash implementation lock-free... And as a small aside, to convert the refcounting to use atomics.

To make things even more fun, helgrind is also unhappy with OpenSSL - haven't looked why/how to be honest. It can be side-stepped by a) having a more granular kmod_module_get_info variant which doesn't parse the crypto bits and/or b) adding an alternative code-path to OpenSSL.

Since I don't know if I'll be looking at it, I'm going to park this as draft.

evelikov · 2026-02-15T21:17:37Z

If OpenSSL/helgrind proves to be a blocking factor, it can potentially be side-stepped by #426.

lucasdemarchi · 2026-05-07T06:33:01Z

+	struct thread_info {
+		pthread_t tid;
+		struct module_symbols mod_syms;
+	} *tinfo;


I was thinking about making mod_syms a thread_local, but... we we will already need to allocate the array for keeping the tid

lucasdemarchi · 2026-05-07T06:49:26Z

+			if (local_err != 0)
+				err = -local_err;
+			if ((int)(intptr_t)res != 0)
+				err = (int)(intptr_t)res;


should we worry about the other threads continuing even if one of them already falied? I guess we can keep it simple, otherwise I think we'd need to add a cond var just for this.

lucasdemarchi · 2026-05-07T06:52:40Z

+static int depmod_symbol_add(struct depmod *depmod, struct symbol *sym)
+{
+	int err = hash_add(depmod->symbols, sym->name, sym);
+	if (err < 0)


should we keep the free(sym) here to avoid leaking it?

lucasdemarchi · 2026-05-07T07:11:31Z

+
+	modules_per_thread = (depmod->modules.count + n_threads - 1) / n_threads;
+	last_modules_per_thread =
+		modules_per_thread - (depmod->modules.count % n_threads);


11 modules, 4 threads:

modules_per_thread == (11 + 3) / 4 == 3 last_modules_per_thread = 3 - (11 % 4) = 0

I think you wanted:

last_modules_per_thread = depmod->modules.count - n_threads * modules_per_thread;

so the last thread has potentially fewer modules to calculate:

tinfo[0].mod_syms.modules_count = 3
tinfo[1].mod_syms.modules_count = 3
tinfo[2].mod_syms.modules_count = 3
tinfo[3].mod_syms.modules_count = 2

lucasdemarchi · 2026-05-07T07:19:56Z

+		mod_syms->depmod = depmod;
+		mod_syms->modules =
+			(struct mod **)depmod->modules.array + (i * modules_per_thread);
+		mod_syms->modules_count = modules_per_thread;


humn... assuming the threads are working on different modules, and thus have access to different kmod_mod, I think it should be fine to call into kmod from a concurrency perspective. One place I'm not so sure is when freeing the module - because that puts it back in the pool and there may be a synchronization issue with another thread creating the module.

The other issue would be logging, but that I think we could ignore.

Maybe it would be better from a correctness perspective to clone the kmod_ctx per thread?

lucasdemarchi · 2026-05-07T07:23:43Z

Kicking off helgrind reminded me that we mutate kmod_ctx for each new kmod_module instance. In that the former has a pool (hash map) with all the modules.

We can resolve, at least some of, the problem by deferring the hash-entry removal(s) aka kmod_module_unref to the main thread, which means we don't free until the end of the program - easily reaching 0.5GB of memory. Something we cannot do in the general case.

So as a whole, we could look on making hash implementation lock-free... And as a small aside, to convert the refcounting to use atomics.

sorry, just saw this comment after reviewing... you had already caught that issue. But this solution still seems fragile - we are breaking the rule of not calling into libkmod from multiple threads. I think just cloning the ctx would be better. This is what e.g. udev does to handle multithread.

evelikov and others added 4 commits February 13, 2026 18:26

tools/depmod: Parallelize module loading

f65443f

Add pthread-based parallelization to depmod to improve performance on multi-core systems. The changes parallelize: - Module symbol loading Co-authored-by: jared mauch <jared@puck.nether.net> Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>

evelikov force-pushed the threaded-depmod branch from c33b7b1 to f65443f Compare February 13, 2026 19:43

github-advanced-security AI found potential problems Feb 13, 2026

View reviewed changes

evelikov marked this pull request as draft February 14, 2026 13:57

lucasdemarchi requested changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threaded depmod, take 2#425

Threaded depmod, take 2#425
evelikov wants to merge 4 commits into
kmod-project:masterfrom
evelikov:threaded-depmod

evelikov commented Feb 13, 2026

Uh oh!

Check warning

evelikov commented Feb 13, 2026

Uh oh!

evelikov commented Feb 14, 2026

Uh oh!

evelikov commented Feb 15, 2026

Uh oh!

lucasdemarchi May 7, 2026

Uh oh!

lucasdemarchi May 7, 2026

Uh oh!

lucasdemarchi May 7, 2026

Uh oh!

lucasdemarchi May 7, 2026

Uh oh!

lucasdemarchi May 7, 2026

Uh oh!

lucasdemarchi commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		for (unsigned int i = 0; i < n_threads; i++) {
		struct module_symbols *mod_syms = &tinfo[i].mod_syms;

Conversation

evelikov commented Feb 13, 2026

Uh oh!

Check warning

Uh oh!

evelikov commented Feb 13, 2026

Uh oh!

evelikov commented Feb 14, 2026

Uh oh!

evelikov commented Feb 15, 2026

Uh oh!

lucasdemarchi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

lucasdemarchi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

lucasdemarchi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

lucasdemarchi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

lucasdemarchi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

lucasdemarchi commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants