Skip to content

depmod: Parallelize dependency computation and output generation#412

Closed
jaredmauch wants to merge 1 commit into
kmod-project:masterfrom
jaredmauch:depmod_threaded
Closed

depmod: Parallelize dependency computation and output generation#412
jaredmauch wants to merge 1 commit into
kmod-project:masterfrom
jaredmauch:depmod_threaded

Conversation

@jaredmauch
Copy link
Copy Markdown
Contributor

Parallelize depmod to leverage multiple CPU cores. The changes include:

  • Parallel module loading: Load symbols and module info concurrently across multiple threads
  • Parallel dependency resolution: Resolve module dependencies concurrently
  • Parallel dependency caching: Pre-compute transitive dependency caches in parallel before output generation
  • Parallel output generation: Generate modules.dep and modules.dep.bin files concurrently

Thread safety is ensured through:

  • Mutex protection for shared data structures (symbols, dependencies, output)
  • Atomic operations for reference counting
  • Memory barriers to ensure visibility of atomic operations
  • Fine-grained locking to minimize contention

Example improvements are about 5-6x reduction in wall clock time of depmod in my testing, feedback most welcome!

@stoeckmann
Copy link
Copy Markdown
Contributor

Running it multiple times with ASAN enabled reveals race conditions. I get double frees and heap overflows.

Ignoring these, the performance gains are not as specified, which I would have to further investigate. The biggest problem on my test machine should be the amount of CPU cores. The threads will lock each other rather often, leading to unnecessary high amount of context switches.

Can you split this into multiple commits or just use threads for the parsing aspect which can properly scale?

Also, how did you get these high numbers? Have you run multiple times to make sure that you do not measure disk caches in the kernel?

@jaredmauch
Copy link
Copy Markdown
Contributor Author

So running back-to-back I see this, and I can update the PR to split things out - This system has 48 CPUs, so while it may have a lot of CS blocking the runtime reduction is quite significant for me, eg:

jared@ru1415:~/kmod/builddir$ echo 'Debian trixie time:' ; echo `nproc` CPUs ; dpkg -l kmod ; /sbin/depmod -V ; sudo time /sbin/depmod -a ; git branch ; echo 'threaded branch time';  ./depmod -V ; sudo time ./depmod -a
Debian trixie time:
48 CPUs
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=======================================
ii  kmod           34.2-2       amd64        tools for managing Linux kernel modules
kmod version 34.2
+ZSTD +XZ -ZLIB +OPENSSL
35.43user 0.70system 0:36.15elapsed 99%CPU (0avgtext+0avgdata 195368maxresident)k
0inputs+6784outputs (0major+87389minor)pagefaults 0swaps
* depmod_threaded
  master
threaded branch time
kmod version 34
+ZSTD +XZ +ZLIB +OPENSSL
53.32user 5.28system 0:06.03elapsed 970%CPU (0avgtext+0avgdata 736128maxresident)k
0inputs+6784outputs (0major+214059minor)pagefaults 0swaps

The disk cache is not a factor here, both should be running from cache. My modules are xz compressed in this case, but I can test without that, eg:

jared@ru1415:~/kmod/builddir$ ls /lib/modules/`uname -r`/kernel/*/*.ko* | head -5
/lib/modules/6.19.0-rc6+/kernel/block/bfq.ko.xz
/lib/modules/6.19.0-rc6+/kernel/block/kyber-iosched.ko.xz
/lib/modules/6.19.0-rc6+/kernel/crypto/adiantum.ko.xz
/lib/modules/6.19.0-rc6+/kernel/crypto/aegis128.ko.xz
/lib/modules/6.19.0-rc6+/kernel/crypto/aes_ti.ko.xz

I'll check through for double free/overflow issues w/ ASAN.. I thought I had gotten them all but I guess not :-(

@lucasdemarchi
Copy link
Copy Markdown
Contributor

Thanks, I will take a look soon.

The disk cache is not a factor here, both should be running from cache. My modules are xz compressed in this case, but I can test without that, eg:

I think being xz compressed is a big factor here as the decompression is slow and you are profiting from decompressing the modules in parallel. I wouldn't expect that big number for non-compressed or zstd-compressed modules.

@stoeckmann
Copy link
Copy Markdown
Contributor

stoeckmann commented Jan 21, 2026

I think being xz compressed is a big factor here as the decompression is slow and you are profiting from decompressing the modules in parallel. I wouldn't expect that big number for non-compressed or zstd-compressed modules.

With zstd, it's going down from 1.4 s to 1.0 s, with xz from 11 s to 6 s. 14 cores on Arch Linux with 6407 modules)

Just tested scaling on my system (I'm not going further than 6 cores. Not out of laziness but because improvement stops, which I think is based on the hybrid e/p core stuff of my cpu):

1: 11.0 s
2: 9.0 s
3: 8.4 s
4: 7.6 s
5: 6.6 s
6: 6.1 s

@stoeckmann
Copy link
Copy Markdown
Contributor

jared@ru1415:~/kmod/builddir$ echo 'Debian trixie time:' ; echo `nproc` CPUs ; dpkg -l kmod ; /sbin/depmod -V ; sudo time /sbin/depmod -a ; git branch ; echo 'threaded branch time';  ./depmod -V ; sudo time ./depmod -a

The host kmod version and patched kmod master version are very close to each other, but it might also depend on compiler flags in place, e.g. debug vs release build with meson.

Might be interesting to re-run the tests. I have repacked my modules (zstd) to (xz) and mentioned the vast difference between zstd and xz above.

Thus, I think most of the advantage is the parallel file parsing, where depmod loses majority of its time. Having only file parsing in parallel will keep the diff simpler. Looking forward to hear about results when only that aspect is parallel. Also would be nice to know how the scaling is on your system. Is there a linear speed improvement with more cores?

Comment thread tools/depmod.c Outdated
@jaredmauch
Copy link
Copy Markdown
Contributor Author

I think I got the items, I'll squash/force push in a few - waiting for good ol valgrind to do it's thing here..

jared@ru1415:~/kmod/builddir$ sudo valgrind --leak-check=full --track-origins=yes --progress-interval=30 ./depmod -v
==342455== Memcheck, a memory error detector
==342455== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==342455== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info
==342455== Command: ./depmod -v
==342455== 
--342455-- PROGRESS: U 30s, W 31s, 96.8% CPU, EvC 52.93M, TIn 7.2k, TOut 0.0k, #thr 49
--342455-- PROGRESS: U 60s, W 63s, 95.2% CPU, EvC 101.51M, TIn 7.2k, TOut 0.0k, #thr 49
--342455-- PROGRESS: U 90s, W 99s, 90.9% CPU, EvC 129.51M, TIn 7.2k, TOut 0.0k, #thr 49
--342455-- PROGRESS: U 120s, W 133s, 90.2% CPU, EvC 165.03M, TIn 7.3k, TOut 0.0k, #thr 48
--342455-- PROGRESS: U 152s, W 174s, 87.4% CPU, EvC 169.33M, TIn 7.3k, TOut 0.0k, #thr 48
--342455-- PROGRESS: U 180s, W 209s, 86.1% CPU, EvC 171.33M, TIn 7.3k, TOut 0.0k, #thr 48
--342455-- PROGRESS: U 210s, W 246s, 85.4% CPU, EvC 173.53M, TIn 7.3k, TOut 0.0k, #thr 48
--342455-- PROGRESS: U 240s, W 283s, 84.8% CPU, EvC 175.63M, TIn 7.3k, TOut 0.0k, #thr 48

it clocked past an hour before earlier when I killed it and went to restart - I'll leave this running and check on it in the AM worst thing (~+14h from now?)

Thanks for the feedback today and the insight this is [likely] all xz related

@stoeckmann
Copy link
Copy Markdown
Contributor

stoeckmann commented Jan 21, 2026

I took the current version of the PR (compiles fine now) and replaced all get_cpu_count() calls except the one in depmod_load_modules with fixed constant 1 to avoid threading.

The result is a tiny speed increasement, most likely due to less lock handling of the system. Could you verify this on your system @jaredmauch? I'm especially interested in your high user and system times with the threading version. It should be less then.

Most of these activities are very bad for parallel processing as long as everything is eventually put into the same data structure like the output trees.

Also, I'm still seeing memory leaks during these runs at times. Not sure if you have local changes for these as well already.

@jaredmauch
Copy link
Copy Markdown
Contributor Author

Yes I do. valgrind is still running as well and I'll review the results in the AM. gonna wander away from the computer in a moment for my own sanity, but I'll force push now so you can poke at them if you're interested.

I'll also maybe just nuke module compression in my test setups as well, this turned into a bit of a side quest of side quests..

(mumbles re: valgrind)

--342455-- PROGRESS: U 5,588s, W 6,684s, 83.6% CPU, EvC 1676.66M, TIn 7.3k, TOut 0.0k, #thr 25
--342455-- PROGRESS: U 5,611s, W 6,711s, 83.6% CPU, EvC 1677.06M, TIn 7.3k, TOut 0.0k, #thr 25

Add pthread-based parallelization to depmod to improve performance on
multi-core systems. The changes parallelize:
- Module symbol loading
- Dependency computation
- Dependency sorting
- Output generation

To ensure deterministic output order for modules.dep files, output lines
are collected from worker threads and sorted alphabetically by module
path before writing.

Add pthread dependency to meson.build and update test expectations to
match the new sorted output format.
@stoeckmann
Copy link
Copy Markdown
Contributor

stoeckmann commented Jan 24, 2026

I took your latest version of the PR and split the module symbol loading part into an own branch, which you can find here: https://github.com/stoeckmann/kmod/tree/depmod_pthread_slim

It reduces the diff without any further adjustments down to 200 lines compared to currently 889. Absolutely no surprise, obviously, since I removed changes.

The more important aspect is that the reduced version avoids any form of locking, because symbol loading threads can fully work in parallel. This makes it MUCH easier to verify correctness.

Also, the reduced version has slightly higher throughput due to lack of locks. And less instructions in total.

I'd recommend that we focus on this aspect first, since these other 600 lines increase binary size, complexity and lead to slower execution times.

Here are my perf stat depmod -o /tmp/out results (zstd compressed, Arch Linux, rather slow dual core). There's not much use in focusing on the real time, since you can see that it scaled a bit up and down. The really interesting aspects are the instruction counts and task counts.

master

$ perf stat depmod -o /tmp/output

 Performance counter stats for 'depmod -o /tmp/output':

     7,661,698,622      task-clock:u                     #    0.967 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           239,858      page-faults:u                    #   31.306 K/sec                     
    19,239,683,846      instructions:u                   #    1.31  insn per cycle            
    14,709,275,791      cycles:u                         #    1.920 GHz                       
     3,206,761,967      branches:u                       #  418.545 M/sec                     
        58,520,677      branch-misses:u                  #    1.82% of all branches           

       7.920360567 seconds time elapsed

       6.190921000 seconds user
       1.343797000 seconds sys

depmod_threaded

$ perf stat depmod -o /tmp/output

 Performance counter stats for 'depmod -o /tmp/output':

     8,452,885,314      task-clock:u                     #    1.441 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           239,866      page-faults:u                    #   28.377 K/sec                     
    19,428,974,701      instructions:u                   #    1.29  insn per cycle            
    15,005,888,514      cycles:u                         #    1.775 GHz                       
     3,256,194,805      branches:u                       #  385.217 M/sec                     
        59,641,380      branch-misses:u                  #    1.83% of all branches           

       5.864479618 seconds time elapsed

       6.524242000 seconds user
       1.754721000 seconds sys

depmod_pthread_slim

$ perf stat depmod -o /tmp/output

 Performance counter stats for 'depmod -o /tmp/output':

     7,967,206,929      task-clock:u                     #    1.410 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           239,800      page-faults:u                    #   30.098 K/sec                     
    19,303,882,074      instructions:u                   #    1.31  insn per cycle            
    14,711,762,542      cycles:u                         #    1.847 GHz                       
     3,221,880,958      branches:u                       #  404.393 M/sec                     
        59,201,477      branch-misses:u                  #    1.84% of all branches           

       5.651478036 seconds time elapsed

       6.396976000 seconds user
       1.422937000 seconds sys

@evelikov
Copy link
Copy Markdown
Collaborator

Thanks for the idea @jaredmauch o/

As Lucas mentioned, decompression is universally the bottleneck. The second one being the signature/crypto handling. Here are some rough/non-scientific numbers:

  • uncompressed - ~25% is signature/crypto, the rest evenly spread throughout
  • xz - ~90% is lzma_code, everything else is in the noise margin
  • zstd - ~55% is ZSTD_decompress, ~7% is signature/crypto, rest is in the noise margin

Quick test of the "slim" version shows promising results, while being much leaner, so I'd be in favour of it.

before wall/user/sys after wall/user/sys compression
0.37/0.29/0.08 0.37/0.48/0.36 none
4.06/3.96/0.09 1.40/5.65/0.27 xz
0.91/0.83/0.08 0.49/1.26/0.35 zstd

Aside: remember how I mentioned signature/crypto just above? It contributes to ~75% of the kmod_module_get_info runtime, yet the data is not used by depmod. So there's potential for another up-to 25% improvement, if you are interested.

@evelikov
Copy link
Copy Markdown
Collaborator

In the linked PR, I've polished the slim version a bit and added some more numbers from my end.

As indicated in there - we have somewhat big issues to deal with before either variant can land. HTH o/

@evelikov
Copy link
Copy Markdown
Collaborator

  • uncompressed - ~25% is signature/crypto, the rest evenly spread throughout

Managed to get that to 0%, by using mbedtls instead of openssl - see #426. Not sure if that'll land, although if you're interested feel free to see if it makes any measurable difference on your end.

@evelikov
Copy link
Copy Markdown
Collaborator

@jaredmauch if you don't mind, I'm going to close this PR.

As mentioned the top offenders are a) decompression and b) crypto/signatures, where each one is (at least partially) tackled by the linked PRs.

@evelikov evelikov closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants