depmod: Parallelize dependency computation and output generation by jaredmauch · Pull Request #412 · kmod-project/kmod

jaredmauch · 2026-01-19T15:33:33Z

Parallelize depmod to leverage multiple CPU cores. The changes include:

Parallel module loading: Load symbols and module info concurrently across multiple threads
Parallel dependency resolution: Resolve module dependencies concurrently
Parallel dependency caching: Pre-compute transitive dependency caches in parallel before output generation
Parallel output generation: Generate modules.dep and modules.dep.bin files concurrently

Thread safety is ensured through:

Mutex protection for shared data structures (symbols, dependencies, output)
Atomic operations for reference counting
Memory barriers to ensure visibility of atomic operations
Fine-grained locking to minimize contention

Example improvements are about 5-6x reduction in wall clock time of depmod in my testing, feedback most welcome!

stoeckmann · 2026-01-21T16:11:59Z

Running it multiple times with ASAN enabled reveals race conditions. I get double frees and heap overflows.

Ignoring these, the performance gains are not as specified, which I would have to further investigate. The biggest problem on my test machine should be the amount of CPU cores. The threads will lock each other rather often, leading to unnecessary high amount of context switches.

Can you split this into multiple commits or just use threads for the parsing aspect which can properly scale?

Also, how did you get these high numbers? Have you run multiple times to make sure that you do not measure disk caches in the kernel?

jaredmauch · 2026-01-21T19:44:30Z

So running back-to-back I see this, and I can update the PR to split things out - This system has 48 CPUs, so while it may have a lot of CS blocking the runtime reduction is quite significant for me, eg:

jared@ru1415:~/kmod/builddir$ echo 'Debian trixie time:' ; echo `nproc` CPUs ; dpkg -l kmod ; /sbin/depmod -V ; sudo time /sbin/depmod -a ; git branch ; echo 'threaded branch time';  ./depmod -V ; sudo time ./depmod -a
Debian trixie time:
48 CPUs
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=======================================
ii  kmod           34.2-2       amd64        tools for managing Linux kernel modules
kmod version 34.2
+ZSTD +XZ -ZLIB +OPENSSL
35.43user 0.70system 0:36.15elapsed 99%CPU (0avgtext+0avgdata 195368maxresident)k
0inputs+6784outputs (0major+87389minor)pagefaults 0swaps
* depmod_threaded
  master
threaded branch time
kmod version 34
+ZSTD +XZ +ZLIB +OPENSSL
53.32user 5.28system 0:06.03elapsed 970%CPU (0avgtext+0avgdata 736128maxresident)k
0inputs+6784outputs (0major+214059minor)pagefaults 0swaps

The disk cache is not a factor here, both should be running from cache. My modules are xz compressed in this case, but I can test without that, eg:

jared@ru1415:~/kmod/builddir$ ls /lib/modules/`uname -r`/kernel/*/*.ko* | head -5
/lib/modules/6.19.0-rc6+/kernel/block/bfq.ko.xz
/lib/modules/6.19.0-rc6+/kernel/block/kyber-iosched.ko.xz
/lib/modules/6.19.0-rc6+/kernel/crypto/adiantum.ko.xz
/lib/modules/6.19.0-rc6+/kernel/crypto/aegis128.ko.xz
/lib/modules/6.19.0-rc6+/kernel/crypto/aes_ti.ko.xz

I'll check through for double free/overflow issues w/ ASAN.. I thought I had gotten them all but I guess not :-(

lucasdemarchi · 2026-01-21T20:01:04Z

Thanks, I will take a look soon.

The disk cache is not a factor here, both should be running from cache. My modules are xz compressed in this case, but I can test without that, eg:

I think being xz compressed is a big factor here as the decompression is slow and you are profiting from decompressing the modules in parallel. I wouldn't expect that big number for non-compressed or zstd-compressed modules.

stoeckmann · 2026-01-21T20:29:55Z

I think being xz compressed is a big factor here as the decompression is slow and you are profiting from decompressing the modules in parallel. I wouldn't expect that big number for non-compressed or zstd-compressed modules.

With zstd, it's going down from 1.4 s to 1.0 s, with xz from 11 s to 6 s. 14 cores on Arch Linux with 6407 modules)

Just tested scaling on my system (I'm not going further than 6 cores. Not out of laziness but because improvement stops, which I think is based on the hybrid e/p core stuff of my cpu):

1: 11.0 s
2: 9.0 s
3: 8.4 s
4: 7.6 s
5: 6.6 s
6: 6.1 s

stoeckmann · 2026-01-21T20:34:06Z

jared@ru1415:~/kmod/builddir$ echo 'Debian trixie time:' ; echo `nproc` CPUs ; dpkg -l kmod ; /sbin/depmod -V ; sudo time /sbin/depmod -a ; git branch ; echo 'threaded branch time';  ./depmod -V ; sudo time ./depmod -a

The host kmod version and patched kmod master version are very close to each other, but it might also depend on compiler flags in place, e.g. debug vs release build with meson.

Might be interesting to re-run the tests. I have repacked my modules (zstd) to (xz) and mentioned the vast difference between zstd and xz above.

Thus, I think most of the advantage is the parallel file parsing, where depmod loses majority of its time. Having only file parsing in parallel will keep the diff simpler. Looking forward to hear about results when only that aspect is parallel. Also would be nice to know how the scaling is on your system. Is there a linear speed improvement with more cores?

jaredmauch · 2026-01-21T22:39:04Z

I think I got the items, I'll squash/force push in a few - waiting for good ol valgrind to do it's thing here..

jared@ru1415:~/kmod/builddir$ sudo valgrind --leak-check=full --track-origins=yes --progress-interval=30 ./depmod -v
==342455== Memcheck, a memory error detector
==342455== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==342455== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info
==342455== Command: ./depmod -v
==342455== 
--342455-- PROGRESS: U 30s, W 31s, 96.8% CPU, EvC 52.93M, TIn 7.2k, TOut 0.0k, #thr 49
--342455-- PROGRESS: U 60s, W 63s, 95.2% CPU, EvC 101.51M, TIn 7.2k, TOut 0.0k, #thr 49
--342455-- PROGRESS: U 90s, W 99s, 90.9% CPU, EvC 129.51M, TIn 7.2k, TOut 0.0k, #thr 49
--342455-- PROGRESS: U 120s, W 133s, 90.2% CPU, EvC 165.03M, TIn 7.3k, TOut 0.0k, #thr 48
--342455-- PROGRESS: U 152s, W 174s, 87.4% CPU, EvC 169.33M, TIn 7.3k, TOut 0.0k, #thr 48
--342455-- PROGRESS: U 180s, W 209s, 86.1% CPU, EvC 171.33M, TIn 7.3k, TOut 0.0k, #thr 48
--342455-- PROGRESS: U 210s, W 246s, 85.4% CPU, EvC 173.53M, TIn 7.3k, TOut 0.0k, #thr 48
--342455-- PROGRESS: U 240s, W 283s, 84.8% CPU, EvC 175.63M, TIn 7.3k, TOut 0.0k, #thr 48

it clocked past an hour before earlier when I killed it and went to restart - I'll leave this running and check on it in the AM worst thing (~+14h from now?)

Thanks for the feedback today and the insight this is [likely] all xz related

stoeckmann · 2026-01-21T23:32:21Z

I took the current version of the PR (compiles fine now) and replaced all get_cpu_count() calls except the one in depmod_load_modules with fixed constant 1 to avoid threading.

The result is a tiny speed increasement, most likely due to less lock handling of the system. Could you verify this on your system @jaredmauch? I'm especially interested in your high user and system times with the threading version. It should be less then.

Most of these activities are very bad for parallel processing as long as everything is eventually put into the same data structure like the output trees.

Also, I'm still seeing memory leaks during these runs at times. Not sure if you have local changes for these as well already.

jaredmauch · 2026-01-22T00:22:43Z

Yes I do. valgrind is still running as well and I'll review the results in the AM. gonna wander away from the computer in a moment for my own sanity, but I'll force push now so you can poke at them if you're interested.

I'll also maybe just nuke module compression in my test setups as well, this turned into a bit of a side quest of side quests..

(mumbles re: valgrind)

--342455-- PROGRESS: U 5,588s, W 6,684s, 83.6% CPU, EvC 1676.66M, TIn 7.3k, TOut 0.0k, #thr 25
--342455-- PROGRESS: U 5,611s, W 6,711s, 83.6% CPU, EvC 1677.06M, TIn 7.3k, TOut 0.0k, #thr 25

Add pthread-based parallelization to depmod to improve performance on multi-core systems. The changes parallelize: - Module symbol loading - Dependency computation - Dependency sorting - Output generation To ensure deterministic output order for modules.dep files, output lines are collected from worker threads and sorted alphabetically by module path before writing. Add pthread dependency to meson.build and update test expectations to match the new sorted output format.

stoeckmann · 2026-01-24T10:44:44Z

I took your latest version of the PR and split the module symbol loading part into an own branch, which you can find here: https://github.com/stoeckmann/kmod/tree/depmod_pthread_slim

It reduces the diff without any further adjustments down to 200 lines compared to currently 889. Absolutely no surprise, obviously, since I removed changes.

The more important aspect is that the reduced version avoids any form of locking, because symbol loading threads can fully work in parallel. This makes it MUCH easier to verify correctness.

Also, the reduced version has slightly higher throughput due to lack of locks. And less instructions in total.

I'd recommend that we focus on this aspect first, since these other 600 lines increase binary size, complexity and lead to slower execution times.

Here are my perf stat depmod -o /tmp/out results (zstd compressed, Arch Linux, rather slow dual core). There's not much use in focusing on the real time, since you can see that it scaled a bit up and down. The really interesting aspects are the instruction counts and task counts.

master

$ perf stat depmod -o /tmp/output

 Performance counter stats for 'depmod -o /tmp/output':

     7,661,698,622      task-clock:u                     #    0.967 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           239,858      page-faults:u                    #   31.306 K/sec                     
    19,239,683,846      instructions:u                   #    1.31  insn per cycle            
    14,709,275,791      cycles:u                         #    1.920 GHz                       
     3,206,761,967      branches:u                       #  418.545 M/sec                     
        58,520,677      branch-misses:u                  #    1.82% of all branches           

       7.920360567 seconds time elapsed

       6.190921000 seconds user
       1.343797000 seconds sys

depmod_threaded

$ perf stat depmod -o /tmp/output

 Performance counter stats for 'depmod -o /tmp/output':

     8,452,885,314      task-clock:u                     #    1.441 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           239,866      page-faults:u                    #   28.377 K/sec                     
    19,428,974,701      instructions:u                   #    1.29  insn per cycle            
    15,005,888,514      cycles:u                         #    1.775 GHz                       
     3,256,194,805      branches:u                       #  385.217 M/sec                     
        59,641,380      branch-misses:u                  #    1.83% of all branches           

       5.864479618 seconds time elapsed

       6.524242000 seconds user
       1.754721000 seconds sys

depmod_pthread_slim

$ perf stat depmod -o /tmp/output

 Performance counter stats for 'depmod -o /tmp/output':

     7,967,206,929      task-clock:u                     #    1.410 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           239,800      page-faults:u                    #   30.098 K/sec                     
    19,303,882,074      instructions:u                   #    1.31  insn per cycle            
    14,711,762,542      cycles:u                         #    1.847 GHz                       
     3,221,880,958      branches:u                       #  404.393 M/sec                     
        59,201,477      branch-misses:u                  #    1.84% of all branches           

       5.651478036 seconds time elapsed

       6.396976000 seconds user
       1.422937000 seconds sys

evelikov · 2026-01-28T16:07:46Z

Thanks for the idea @jaredmauch o/

As Lucas mentioned, decompression is universally the bottleneck. The second one being the signature/crypto handling. Here are some rough/non-scientific numbers:

uncompressed - ~25% is signature/crypto, the rest evenly spread throughout
xz - ~90% is lzma_code, everything else is in the noise margin
zstd - ~55% is ZSTD_decompress, ~7% is signature/crypto, rest is in the noise margin

Quick test of the "slim" version shows promising results, while being much leaner, so I'd be in favour of it.

before wall/user/sys	after wall/user/sys	compression
0.37/0.29/0.08	0.37/0.48/0.36	none
4.06/3.96/0.09	1.40/5.65/0.27	xz
0.91/0.83/0.08	0.49/1.26/0.35	zstd

Aside: remember how I mentioned signature/crypto just above? It contributes to ~75% of the kmod_module_get_info runtime, yet the data is not used by depmod. So there's potential for another up-to 25% improvement, if you are interested.

evelikov · 2026-02-14T14:00:27Z

In the linked PR, I've polished the slim version a bit and added some more numbers from my end.

As indicated in there - we have somewhat big issues to deal with before either variant can land. HTH o/

evelikov · 2026-02-15T21:15:43Z

uncompressed - ~25% is signature/crypto, the rest evenly spread throughout

Managed to get that to 0%, by using mbedtls instead of openssl - see #426. Not sure if that'll land, although if you're interested feel free to see if it makes any measurable difference on your end.

evelikov · 2026-04-20T14:26:58Z

@jaredmauch if you don't mind, I'm going to close this PR.

As mentioned the top offenders are a) decompression and b) crypto/signatures, where each one is (at least partially) tackled by the linked PRs.

jaredmauch mentioned this pull request Jan 19, 2026

Depmod can take a long time to resolve depenencies #413

Open

stoeckmann suggested changes Jan 21, 2026

View reviewed changes

Comment thread tools/depmod.c Outdated

jaredmauch force-pushed the depmod_threaded branch from eed529c to 107366d Compare January 22, 2026 00:22

jaredmauch force-pushed the depmod_threaded branch from 7c0797d to f396fd6 Compare January 22, 2026 04:03

jaredmauch requested a review from stoeckmann January 22, 2026 13:49

evelikov mentioned this pull request Feb 13, 2026

Threaded depmod, take 2 #425

Draft

evelikov closed this Apr 20, 2026

Conversation

jaredmauch commented Jan 19, 2026

Uh oh!

stoeckmann commented Jan 21, 2026

Uh oh!

jaredmauch commented Jan 21, 2026

Uh oh!

lucasdemarchi commented Jan 21, 2026

Uh oh!

stoeckmann commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stoeckmann commented Jan 21, 2026

Uh oh!

Uh oh!

jaredmauch commented Jan 21, 2026

Uh oh!

stoeckmann commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaredmauch commented Jan 22, 2026

Uh oh!

stoeckmann commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

master

depmod_threaded

depmod_pthread_slim

Uh oh!

evelikov commented Jan 28, 2026

Uh oh!

evelikov commented Feb 14, 2026

Uh oh!

evelikov commented Feb 15, 2026

Uh oh!

evelikov commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stoeckmann commented Jan 21, 2026 •

edited

Loading

stoeckmann commented Jan 21, 2026 •

edited

Loading

stoeckmann commented Jan 24, 2026 •

edited

Loading