depmod: Parallelize dependency computation and output generation#412
depmod: Parallelize dependency computation and output generation#412jaredmauch wants to merge 1 commit into
Conversation
|
Running it multiple times with ASAN enabled reveals race conditions. I get double frees and heap overflows. Ignoring these, the performance gains are not as specified, which I would have to further investigate. The biggest problem on my test machine should be the amount of CPU cores. The threads will lock each other rather often, leading to unnecessary high amount of context switches. Can you split this into multiple commits or just use threads for the parsing aspect which can properly scale? Also, how did you get these high numbers? Have you run multiple times to make sure that you do not measure disk caches in the kernel? |
|
So running back-to-back I see this, and I can update the PR to split things out - This system has 48 CPUs, so while it may have a lot of CS blocking the runtime reduction is quite significant for me, eg: The disk cache is not a factor here, both should be running from cache. My modules are xz compressed in this case, but I can test without that, eg: I'll check through for double free/overflow issues w/ ASAN.. I thought I had gotten them all but I guess not :-( |
|
Thanks, I will take a look soon.
I think being xz compressed is a big factor here as the decompression is slow and you are profiting from decompressing the modules in parallel. I wouldn't expect that big number for non-compressed or zstd-compressed modules. |
With zstd, it's going down from 1.4 s to 1.0 s, with xz from 11 s to 6 s. 14 cores on Arch Linux with 6407 modules) Just tested scaling on my system (I'm not going further than 6 cores. Not out of laziness but because improvement stops, which I think is based on the hybrid e/p core stuff of my cpu): 1: 11.0 s |
The host kmod version and patched kmod master version are very close to each other, but it might also depend on compiler flags in place, e.g. debug vs release build with meson. Might be interesting to re-run the tests. I have repacked my modules (zstd) to (xz) and mentioned the vast difference between zstd and xz above. Thus, I think most of the advantage is the parallel file parsing, where depmod loses majority of its time. Having only file parsing in parallel will keep the diff simpler. Looking forward to hear about results when only that aspect is parallel. Also would be nice to know how the scaling is on your system. Is there a linear speed improvement with more cores? |
|
I think I got the items, I'll squash/force push in a few - waiting for good ol valgrind to do it's thing here.. it clocked past an hour before earlier when I killed it and went to restart - I'll leave this running and check on it in the AM worst thing (~+14h from now?) Thanks for the feedback today and the insight this is [likely] all xz related |
|
I took the current version of the PR (compiles fine now) and replaced all The result is a tiny speed increasement, most likely due to less lock handling of the system. Could you verify this on your system @jaredmauch? I'm especially interested in your high user and system times with the threading version. It should be less then. Most of these activities are very bad for parallel processing as long as everything is eventually put into the same data structure like the output trees. Also, I'm still seeing memory leaks during these runs at times. Not sure if you have local changes for these as well already. |
|
Yes I do. valgrind is still running as well and I'll review the results in the AM. gonna wander away from the computer in a moment for my own sanity, but I'll force push now so you can poke at them if you're interested. I'll also maybe just nuke module compression in my test setups as well, this turned into a bit of a side quest of side quests.. (mumbles re: valgrind) |
eed529c to
107366d
Compare
Add pthread-based parallelization to depmod to improve performance on multi-core systems. The changes parallelize: - Module symbol loading - Dependency computation - Dependency sorting - Output generation To ensure deterministic output order for modules.dep files, output lines are collected from worker threads and sorted alphabetically by module path before writing. Add pthread dependency to meson.build and update test expectations to match the new sorted output format.
7c0797d to
f396fd6
Compare
|
I took your latest version of the PR and split the module symbol loading part into an own branch, which you can find here: https://github.com/stoeckmann/kmod/tree/depmod_pthread_slim It reduces the diff without any further adjustments down to 200 lines compared to currently 889. Absolutely no surprise, obviously, since I removed changes. The more important aspect is that the reduced version avoids any form of locking, because symbol loading threads can fully work in parallel. This makes it MUCH easier to verify correctness. Also, the reduced version has slightly higher throughput due to lack of locks. And less instructions in total. I'd recommend that we focus on this aspect first, since these other 600 lines increase binary size, complexity and lead to slower execution times. Here are my masterdepmod_threadeddepmod_pthread_slim |
|
Thanks for the idea @jaredmauch o/ As Lucas mentioned, decompression is universally the bottleneck. The second one being the signature/crypto handling. Here are some rough/non-scientific numbers:
Quick test of the "slim" version shows promising results, while being much leaner, so I'd be in favour of it.
Aside: remember how I mentioned signature/crypto just above? It contributes to ~75% of the |
|
In the linked PR, I've polished the slim version a bit and added some more numbers from my end. As indicated in there - we have somewhat big issues to deal with before either variant can land. HTH o/ |
Managed to get that to 0%, by using mbedtls instead of openssl - see #426. Not sure if that'll land, although if you're interested feel free to see if it makes any measurable difference on your end. |
|
@jaredmauch if you don't mind, I'm going to close this PR. As mentioned the top offenders are a) decompression and b) crypto/signatures, where each one is (at least partially) tackled by the linked PRs. |
Parallelize depmod to leverage multiple CPU cores. The changes include:
Thread safety is ensured through:
Example improvements are about 5-6x reduction in wall clock time of depmod in my testing, feedback most welcome!