Threaded depmod, take 2#425
Conversation
Currently, in the (unlikely) case of array_append failing, we'll return success and thus partial output. Fixes: 823849a ("tools/depmod: use separate arrays for alias,xxxdep values") Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
With a later commit, we'll need to create symbols outside of the depmod_symbol_add function, so split the function in two. Aside: depmod_symbol_add should really get some error checking, that we're not helping with here... Also we should consider using hash_add_unique() since currently a duplicate symbol will override the previous one. But all of that can happen at a later date. Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Split out the module symbol resolution - effectively, the initial decompression of the module and elf parsing - from adding the data into the depmod hashmap. This will come in handy with the next commit, which will delegate the symbol resolution (and thus the slowest part, decompression) to threads. Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Add pthread-based parallelization to depmod to improve performance on multi-core systems. The changes parallelize: - Module symbol loading Co-authored-by: jared mauch <jared@puck.nether.net> Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
c33b7b1 to
f65443f
Compare
|
|
||
| for (unsigned int i = 0; i < n_threads; i++) { | ||
| struct module_symbols *mod_syms = &tinfo[i].mod_syms; | ||
|
|
Check warning
Code scanning / CodeQL
Local variable address stored in non-local memory Warning
|
Seems like the CI is failing, even if it's working fine locally. Will check at some point tomorrow/day after. In the meanwhile, do people have a preference about:
|
|
Kicking off helgrind reminded me that we mutate We can resolve, at least some of, the problem by deferring the hash-entry removal(s) aka So as a whole, we could look on making hash implementation lock-free... And as a small aside, to convert the refcounting to use atomics. To make things even more fun, helgrind is also unhappy with OpenSSL - haven't looked why/how to be honest. It can be side-stepped by a) having a more granular Since I don't know if I'll be looking at it, I'm going to park this as draft. |
|
If OpenSSL/helgrind proves to be a blocking factor, it can potentially be side-stepped by #426. |
| struct thread_info { | ||
| pthread_t tid; | ||
| struct module_symbols mod_syms; | ||
| } *tinfo; |
There was a problem hiding this comment.
I was thinking about making mod_syms a thread_local, but... we we will already need to allocate the array for keeping the tid
| if (local_err != 0) | ||
| err = -local_err; | ||
| if ((int)(intptr_t)res != 0) | ||
| err = (int)(intptr_t)res; |
There was a problem hiding this comment.
should we worry about the other threads continuing even if one of them already falied? I guess we can keep it simple, otherwise I think we'd need to add a cond var just for this.
| static int depmod_symbol_add(struct depmod *depmod, struct symbol *sym) | ||
| { | ||
| int err = hash_add(depmod->symbols, sym->name, sym); | ||
| if (err < 0) |
There was a problem hiding this comment.
should we keep the free(sym) here to avoid leaking it?
|
|
||
| modules_per_thread = (depmod->modules.count + n_threads - 1) / n_threads; | ||
| last_modules_per_thread = | ||
| modules_per_thread - (depmod->modules.count % n_threads); |
There was a problem hiding this comment.
11 modules, 4 threads:
modules_per_thread == (11 + 3) / 4 == 3
last_modules_per_thread = 3 - (11 % 4) = 0
I think you wanted:
last_modules_per_thread = depmod->modules.count - n_threads * modules_per_thread;
so the last thread has potentially fewer modules to calculate:
tinfo[0].mod_syms.modules_count = 3
tinfo[1].mod_syms.modules_count = 3
tinfo[2].mod_syms.modules_count = 3
tinfo[3].mod_syms.modules_count = 2
| mod_syms->depmod = depmod; | ||
| mod_syms->modules = | ||
| (struct mod **)depmod->modules.array + (i * modules_per_thread); | ||
| mod_syms->modules_count = modules_per_thread; |
There was a problem hiding this comment.
humn... assuming the threads are working on different modules, and thus have access to different kmod_mod, I think it should be fine to call into kmod from a concurrency perspective. One place I'm not so sure is when freeing the module - because that puts it back in the pool and there may be a synchronization issue with another thread creating the module.
The other issue would be logging, but that I think we could ignore.
Maybe it would be better from a correctness perspective to clone the kmod_ctx per thread?
sorry, just saw this comment after reviewing... you had already caught that issue. But this solution still seems fragile - we are breaking the rule of not calling into libkmod from multiple threads. I think just cloning the ctx would be better. This is what e.g. udev does to handle multithread. |
This is a retake of #412, the slimmed down version in particular.
The changes:
split out report error on array_append failure into separate patch
reuse struct symbol, instead or rolling our own
split/reuse symbol creation - 2 times fewer allocations, honour sym_prefix
don't allocate the struct symbol(s) twice
consolidate cleanup into a helper function
remove unused members in load_modules_work, rename
swap modules, start and end for modules(_start) and count
store thread id and "work" struct in single place - nproc fewerer calloc
properly check for errors on pthread_create/join
In summary, the numbers are as follows:
Which are mostly identical to previously, with a notable exception for uncompressed modules. Unsurprisingly adding extra threads damages performance.
Some nproc numbers from an aging 4 core + HT laptop.
Some `perf stat -r 100` numbers
Even with the 100 repeat, the numbers are not that stable/reliable.
Valgrind memory stats
Summary, relative to master
The 27k -> 160 changes are expected since we reuse
struct symbolsalthough why we get an extra 5MB memory for nproc is beyond me 🤷