Far too large pull - grid/slurm support, checkpointing, extra features#73
Open
flowers9 wants to merge 57 commits into
Open
Far too large pull - grid/slurm support, checkpointing, extra features#73flowers9 wants to merge 57 commits into
flowers9 wants to merge 57 commits into
Conversation
added 30 commits
February 19, 2019 14:49
packages) with the -G and -S options. It does checkpointing for mecat2pw and mecat2cns, allowing restarting of failed jobs. It also allows correction against (in mecat2pw) and of (in mecat2cns) a subset of the given reads with the -R option. The -k option of mecat2cns has been changed to default to zero rather than 10, with assumption that a quicker partitioning is better (and setting -k to zero is now the same as using a negative value, rather than creating an infinite loop).
Some minor warnings (signed/unsigned comparisons and such) were fixed, and packed_db has been reformatted in preparation to allow larger read sets.
It was just the position of the entry in the array, which is kinda pointless.
The rand() calls for non-AGCT basepairs in packed_db got replaced by a deterministic function to allow identical output from reruns. Some more reformatting while planning the upcoming change allowing for large fasta files in mecat2cns.
commented out methods that weren't used anywhere
in mecat2cns; also moved packed_db into mecat2cns, since that's the only place it's used (the bits that were kinda used in common (lookup_table and split_database) shouldn't have been using it, as they were treating it as subroutines with a fixed interface, not a class
also removed unneeded aserts from dw
slightly worried about the memory footprint so currently using u4_t to hold the read index which limits total reads to 2^32-1, rather than 2^63-1 for the rest of the program. Easy to change, but will up the memory footprint of the reordering by taking 32 bytes per candidate rather than 16
if there are too many reads to reorder up front, check for it and fail back to the older method (i.e., splitting candidates by read id and reordering inside each partition)
mainly to make reordering optional for now, as it needs more work - it's too memory intensive for something that's supposed to mainly be used when memory is low. Also made sure checks for minimum coverage were always applied regardless of what processing options were chosen.
Changed structures to help lower memory usage, some refactoring to help add read sorting to also help with memory usage
pulling recent changes into branch, since it's never going the other way
Also changed spun off processes to not bother listing options if they're defaults
in the end, it would just take too much memory to hold the read-read pairings, which doesn't work well when the whole point is to limit memory usage
mainly to reduce memory usage (no need to copy the list when I can simply sort it instead)
also in the middle of some memory testing
added 27 commits
April 19, 2019 16:29
mostly - two of the subclasses still have small mallocs
…rings however, this does appear to have slowed thigns down a bit, I suspect mainly because of the clearing/recreating of strings, but that can be addressed now that we're off static arrays
also created unified buffer for output instead of left/right buffers
but now getting free errors in the boost routines, for some reason
finally nailed all the bugs (I hope) I created by changing dw.cpp, and the changes should speed up alignment creation as well as reduce memory usage
also testing d_path as a deque rather than a vector
turns out dynamic allocation comes with a large cost - 33% slower, and not appreciably less memory usage. The other changes made a major speed increase, though, 2.5-3x increase.
though it's not currently settable
renamed some variables, made end of band calculations a bit quicker
use actual error_rate, not .25, and correct align size, which should be based directly off the extend size, not k_offset
changed a few vector<char> to vectir<uint1> when they just held values from 0-4; changed pthread mutexes to std::mutex, which requires c++11
it's not just a right triangle, it's a bounded one
got rid of argument.*, which was no longer used, added and removed various #includes to better reflect what was actually needed, changed Align() to finish out the inner loop (k_min to k_max) when it hit the termination condition and choose the best of the terminating k values rather than the first one
The lto additions might not be portable, though (particularly the change to src/mecat2cns/main.mk, as I had to specify the plugin location for ar)
Got rid of non-standard basic type definitions in mecat2cns, changed all asserts to assert()
vectorized using SSE2 commands and a touch of assembly more conversion of idx_t to int64_t
improved both the vectorized and non-vectorized string comparison in the inner loop; vectorization relies on sse2 gnu intrinsics and the bsfl/bsrl assembly commands vectorized version is roughly 10% faster
some variables renamed to be more expressive, some int64_t changed to int
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In brief, this pull provides basic support for grid/slurm (and possibly other remote queueing packages) with the -G and -S options (it supports job queueing and tracking of job completion). It does checkpointing for mecat2pw and mecat2cns, allowing restarting of failed jobs (but only for can runs, not m4). It also allows correction against (in mecat2pw) and of (in mecat2cns) a subset of the given reads with the -R option.
The -k option of mecat2cns has been changed to default to zero rather than 10, with the assumption that a quicker partitioning is better (and setting -k to zero is now the same as using a negative value, rather than creating an infinite loop).
It also changes index_t to idx_t (to avoid a solaris namespace conflict) and arbitrarily changes to the code style to one I can read more easily in code I needed to make changes to.
There was a small bug fix to findErrors.C as well to prevent crashes from chunks with no matching reads.