Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
118 commits
Select commit Hold shift + click to select a range
37848a6
some progress on hub label integration?
electricEpilith Nov 12, 2025
297a11f
hub labeling in (debugging not finished), also changes to deal with C…
electricEpilith Jan 23, 2026
9d4c2e2
Point at compatible libbdsg and get build working on Mac
adamnovak Jan 23, 2026
8c13cf3
Use the new indexing types and accessors to avoid fetching nodes by t…
adamnovak Jan 23, 2026
788224d
Use accessors so we can build the Tiny oversized snarl test index and…
adamnovak Jan 23, 2026
4985468
Try dumping hub label data for debugging
adamnovak Jan 26, 2026
86e4e31
Add synthetic Boost graph dumping code, and missing semicolon, and li…
adamnovak Jan 27, 2026
ee5bd54
Merge remote-tracking branch 'origin/master' into hublabel
adamnovak Jan 28, 2026
e84e657
Use libbdsg with slightly more implemented hub labeling integration
adamnovak Jan 28, 2026
4f31496
Make sure NodeProp fields are not used before initialization
adamnovak Jan 29, 2026
232a589
Stop trying to look up removed trivial snarls
adamnovak Jan 29, 2026
30e392a
Add the debugging to subgraph finding that I needed to fix ChainRecor…
adamnovak Jan 29, 2026
9639b68
Stop trying to interpret the root as a chain in debug prints
adamnovak Feb 2, 2026
c0db406
Turn off debugging after passing existing snarl distance index tests
adamnovak Feb 2, 2026
ce1027f
Merge remote-tracking branch 'origin/master' into hublabel
adamnovak Feb 10, 2026
77c2ec2
Merge remote-tracking branch 'origin/hublabel' into hublabel
adamnovak Feb 10, 2026
163764f
Make randomized graph test actually exercise oversized snarls sometimes
adamnovak Feb 10, 2026
ddce5f4
Add function for loading a handlegraph from JSON
adamnovak Feb 10, 2026
e56353f
Allow cactus-ifying all handle graphs
adamnovak Feb 10, 2026
4f66c25
Add synthetic fix for actually populating the unique_ptr right
adamnovak Feb 10, 2026
5436d73
Commit partial synthetic refactor to use new JSON load method
adamnovak Feb 10, 2026
2c3721d
Revert "Commit partial synthetic refactor to use new JSON load method"
adamnovak Feb 10, 2026
695cff5
Replace string_to_graph with json2graph
adamnovak Feb 10, 2026
8ede8df
Remove a bunch of mostly unused functions for working with Protobuf G…
adamnovak Feb 10, 2026
e472799
Mostly-automatically convert tests to use vg::io::json2graph
adamnovak Feb 10, 2026
904f445
Remove duplicative JSON to graph function
adamnovak Feb 10, 2026
809a766
Set up tiny test that breaks oversized snarl logic
adamnovak Feb 10, 2026
f2d4f08
Remove unused cases
adamnovak Feb 10, 2026
caaa512
Fill in the dustances through oversized snarls to pass more distance …
adamnovak Feb 11, 2026
f15a5f9
Add exhaustive test for small snarls
adamnovak Feb 11, 2026
a0c71e6
Add a test for one of the failing possible small oversized snarls spe…
adamnovak Feb 11, 2026
8c048e2
Pin down one small graph
adamnovak Feb 11, 2026
e036967
Add more test cases from sequential and random graphs and make them pass
adamnovak Feb 13, 2026
8860e3a
Turn off debugging after passing random graph test
adamnovak Feb 13, 2026
eeaa175
Add initial synthetic code for SnarlDecompositionFuzzer and randomly_…
adamnovak Feb 13, 2026
c2f3279
Synthesize code that puts child chains the right way around
adamnovak Feb 13, 2026
c3cd3b7
Synthesize much simpler code that I designed myself because stochasti…
adamnovak Feb 13, 2026
c75e7ed
Synthesize slightly more encapsulated code
adamnovak Feb 13, 2026
a9ba894
Simplify the cursor loop and the flipping determination
adamnovak Feb 13, 2026
836e7ee
Hook up orientation fuzzers to random graph tests and fail to find mo…
adamnovak Feb 13, 2026
22ff0c4
Dump mostly-synthetic hot tips for cool robots
adamnovak Feb 14, 2026
82678d4
added recombination info in output gam and log
dcmonti Mar 16, 2026
39cc3a5
minimap2 like rescore, progressive recombination penalty
dcmonti Mar 17, 2026
e92a036
Implement populating is_regular and the single strict notion of regul…
adamnovak Mar 20, 2026
e4e1dae
Add debugging and reduce Saturn levels by not consuming all_children
adamnovak Mar 20, 2026
934e53e
Turn off debugging and don't count bounds as children for regularity
adamnovak Mar 20, 2026
21d2bb6
Set looping "distances" in distanceless index so we can tell snarls a…
adamnovak Mar 20, 2026
a203cb6
Use libbdsg that tries not to make way too many MPHF threads
adamnovak Mar 20, 2026
07d0472
Add another test to make sure we aren't missing reversals hiding in t…
adamnovak Mar 20, 2026
9aa832a
don't build tests for sparsehash due to C++20 incompatibility
electricEpilith Mar 21, 2026
85b2f44
don't build tests for sparsehash due to C++20 incompatibility
electricEpilith Mar 21, 2026
804df61
bugfixes for rescoring
dcmonti Mar 23, 2026
180dbf0
Parallelize cache_payloads and re-preload distance index before it
electricEpilith Mar 24, 2026
b8d310d
Atomic-ize progress and remove uninformative comment text
adamnovak Mar 24, 2026
0281b81
Replace snarl tree depth limit with fixed point check
adamnovak Mar 24, 2026
4422b08
Preload distance index only once
adamnovak Mar 24, 2026
aa2d827
Remove extra argument
adamnovak Mar 24, 2026
96b57ee
Use libbdsg that should define child snarl count function
adamnovak Mar 24, 2026
3c27df0
Regular-ify simple snarls
adamnovak Mar 24, 2026
9b48cfe
Merge pull request #4860 from vgteam/parallel-payload-caching
adamnovak Mar 26, 2026
957ebeb
Merge pull request #4857 from vgteam/hublabel-debug
adamnovak Mar 26, 2026
e6343e0
merge newer commits into hublabel
electricEpilith Mar 30, 2026
644b900
move libbdsg up
electricEpilith Mar 30, 2026
44321a3
added back (more) preloading to speed minimizer back up
electricEpilith Mar 30, 2026
0c5c0d9
update oldest-supported-compiler-job, upgrade gcc requirement to 10 t…
electricEpilith Mar 31, 2026
992c1cc
edit correct place for GCC version notice
electricEpilith Mar 31, 2026
7569cc1
move up libbdsg to upgrade snarl distance index version number
electricEpilith Apr 2, 2026
b81e331
add (a substantial amount of) instrumentation for vg giraffe
electricEpilith Apr 3, 2026
528ec4e
fix abs() errors on Mac
electricEpilith Apr 3, 2026
a3760d1
additional abs() fix
electricEpilith Apr 4, 2026
7920c76
minor print changes
electricEpilith Apr 4, 2026
190469a
Revert "minor print changes"
electricEpilith Apr 4, 2026
3e573bd
Revert "add (a substantial amount of) instrumentation for vg giraffe"
electricEpilith Apr 4, 2026
6cf266c
second try at instrumentation
electricEpilith Apr 4, 2026
e5513aa
Revert "second try at instrumentation"
electricEpilith Apr 6, 2026
bea2589
Merge pull request #4868 from electricEpilith/hublabel
electricEpilith Apr 6, 2026
5145cd4
debug for supported paths across chain
dcmonti Apr 6, 2026
e9c3d40
snarl distance index version number update
electricEpilith Apr 7, 2026
e286bf5
Fix typo in src/snarl_distance_index.cpp
electricEpilith Apr 8, 2026
49482ca
Merge commit '508afe07817e1ae7daba5106a91696a5c6e24588' into score_se…
dcmonti Apr 17, 2026
90fd844
Merge commit 'a2eb9ea36c1edf889dae08c0493fe1b2ab4b57b5' into score_se…
adamnovak Apr 20, 2026
b7428d5
stub in missing transition checker
faithokamoto Apr 22, 2026
283abfc
Merge remote-tracking branch 'origin/master' into hublabel
adamnovak Apr 22, 2026
a8fc246
Refactor snarl decomposition capturing and flipping to make tests eas…
adamnovak Apr 22, 2026
6c32359
Move CHOverlay output to libbdsg
adamnovak Apr 22, 2026
63f88d5
Apply my own review suggestions
adamnovak Apr 22, 2026
e12c989
Use libbdsg with some review comments addressed
adamnovak Apr 22, 2026
2493339
Merge remote-tracking branch 'origin/hublabel' into hublabel
adamnovak Apr 22, 2026
02b4506
fill in distance checker
faithokamoto Apr 22, 2026
ceb928d
Merge branch 'master' into missing-trans
faithokamoto Apr 22, 2026
9b56af1
Fix potentially very large loop when looking for rescue graph
Apr 23, 2026
1f1cfda
Merge branch 'master' into missing-trans
faithokamoto Apr 23, 2026
c45bd23
Add an error when minimizers/zipcodes seem older than the distance index
adamnovak Apr 23, 2026
f1cb7e4
Actually compare times and not iterators
adamnovak Apr 23, 2026
0d89ce6
Merge remote-tracking branch 'origin/master' into score_seeds_fix_rec…
adamnovak Apr 23, 2026
78ad8bb
Declare the minimap2-based scores to be right
adamnovak Apr 23, 2026
31db629
Clear explanations before making them to count
adamnovak Apr 23, 2026
3cfc00e
remove unnecessary return info
faithokamoto Apr 23, 2026
d71dc9f
Move minimap2-style scoring logic into functions we could use later, …
adamnovak Apr 23, 2026
0dd4050
Stop protoc from trying to run multiple times at once and destroying …
adamnovak Apr 24, 2026
8a79278
Make sure the distance index is on disk with any pointer fixups befor…
adamnovak Apr 24, 2026
07248ad
unittests to detect possible future regressions
electricEpilith Apr 25, 2026
128b4b5
split snarl_distance_index.cpp into four files
electricEpilith Apr 25, 2026
c33c076
refactor populate_snarl_index to be less complex
electricEpilith Apr 26, 2026
43c97c8
reduce duplication with new SnarlChildGraph class
electricEpilith Apr 26, 2026
ef0a6c5
break apart populate_distance_matrix_row
electricEpilith Apr 26, 2026
809463f
Merge pull request #4885 from vgteam/rescue-bug
xchang1 Apr 27, 2026
9e40f73
Merge remote-tracking branch 'origin/master' into score_seeds_fix_rec…
adamnovak Apr 27, 2026
11e053e
Stop worrying about negative recombination penalties
adamnovak Apr 27, 2026
f665d80
Make the evaluation bonus system clearer
adamnovak Apr 27, 2026
76791ab
Adjust comments
adamnovak Apr 27, 2026
e6d032a
Rename variables
adamnovak Apr 27, 2026
0655352
Merge pull request #4887 from vgteam/score_seeds_fix_recombination
adamnovak Apr 28, 2026
53ab106
Merge pull request #4889 from vgteam/simplify-return
adamnovak Apr 28, 2026
27108c4
Merge pull request #4886 from vgteam/index-age-error
adamnovak Apr 29, 2026
cf99eb7
Merge pull request #4884 from vgteam/missing-trans
faithokamoto Apr 30, 2026
4bdcafb
merge in changes from master
electricEpilith May 1, 2026
94c094f
remove hash from serialization test because of cross-platform compati…
electricEpilith May 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ oldest-supported-compiler-job:
GIT_SUBMODULE_STRATEGY: none
# DO NOT change this version number without updating the README to reflect
# the requirement bump.
COMPILER_VERSION: 9
COMPILER_VERSION: 10


# We define one job to do the Docker container build
Expand Down
75 changes: 75 additions & 0 deletions BOTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# VG Project Notes

## Building

Check notice on line 3 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L3

Expected: 1; Actual: 0; Below
- New `.cpp` files auto-discovered

Check notice on line 4 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L4

Lists should be surrounded by blank lines
- Build with `make -j8` or `make obj/whatever.o` to build just one .o.
- You may be getting errors from `clangd`. If these errors seem spurious, stop and demand a `clangd` that works properly.

## Testing

### Running Bash-TAP Tests

Check notice on line 10 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L10

Expected: 1; Actual: 0; Below
Use `prove -v` (not `bash`) to execute Bash-TAP tests. This provides proper test harness output and better error reporting.

**Important**: Run `prove` from the `test/` directory:
```bash
cd test
prove -v t/26_deconstruct.t
```

### Running Unit Tests

Check notice on line 19 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L19

Expected: 1; Actual: 0; Below
To run all unit tests:
```bash
./bin/vg test
```
- `./bin/vg test "[tag]"` runs tests matching a tag

Check notice on line 24 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L24

Lists should be surrounded by blank lines

#### Writing Unit Tests

Check notice on line 26 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L26

Expected: 1; Actual: 0; Below
- Framework: Catch v2 (header-only)

Check notice on line 27 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L27

Lists should be surrounded by blank lines
- Include: `#include "catch.hpp"` (in `src/unittest/catch.hpp`)
- Macros: `TEST_CASE("name", "[tags]")`, `SECTION("name")`, `REQUIRE(cond)`
- Namespace: `vg::unittest`
- Directory: `src/unittest/`

### Running All Tests

Check notice on line 33 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L33

Expected: 1; Actual: 0; Below
```bash
make test
```

## Writing Code

### HandleGraph API

Check notice on line 40 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L40

Expected: 1; Actual: 0; Below
The interfaces in libhandlegraph model a bidirected sequence graph (where nodes have DNA sequences and edges can connect to either the start or end of each involved node).

#### Core types

Check notice on line 43 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L43

Expected: 1; Actual: 0; Below
- `handle_t` - opaque 64-bit value

Check notice on line 44 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L44

Lists should be surrounded by blank lines
- `nid_t` - node ID type
- `edge_t` = `pair<handle_t, handle_t>`

#### Key HandleGraph methods

Check notice on line 48 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L48

Expected: 1; Actual: 0; Below
- `get_handle(nid_t, bool is_reverse=false)` → `handle_t`

Check notice on line 49 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L49

Lists should be surrounded by blank lines
- `get_id(handle_t)` → `nid_t`
- `get_is_reverse(handle_t)` → `bool`
- `flip(handle_t)` → `handle_t` (toggle orientation)
- `get_sequence(handle_t)` → `string` (in handle's orientation)
- `follow_edges(handle_t, bool go_left, iteratee)` - iterate neighbors
- `for_each_handle(iteratee, bool parallel=false)` - iterate all nodes
- `for_each_edge(iteratee, bool parallel=false)` - iterate all edges
- `has_edge(handle_t left, handle_t right)` → `bool`

#### MutableHandleGraph additions

Check notice on line 59 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L59

Expected: 1; Actual: 0; Below
- `create_handle(string seq)` / `create_handle(string seq, nid_t id)` → `handle_t`

Check notice on line 60 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L60

Lists should be surrounded by blank lines
- `create_edge(handle_t left, handle_t right)`
- `destroy_handle(handle_t)` / `destroy_edge(handle_t, handle_t)`

#### HandleGraph algorithms

Check notice on line 64 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L64

Expected: 1; Actual: 0; Below
- Things like `topological_sort.hpp` and copy_graph.hpp` are in `deps/libhandlegraph/src/include/handlegraph/algorithms`.

Check notice on line 65 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L65

Lists should be surrounded by blank lines

#### bdsg::HashGraph

Check notice on line 67 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L67

Expected: 1; Actual: 0; Below
- Header: `deps/libbdsg/bdsg/include/bdsg/hash_graph.hpp`

Check notice on line 68 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L68

Lists should be surrounded by blank lines
- Implements MutablePathMutableHandleGraph
- Go-to handlegraph implementation to use
- In libbdsg

### Utilities

Check notice on line 73 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L73

Expected: 1; Actual: 0; Below
- `reverse_complement(string)` → `string` in src/utility.hpp

Check notice on line 74 in BOTS.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

BOTS.md#L74

Lists should be surrounded by blank lines

1 change: 1 addition & 0 deletions CLAUDE.md
10 changes: 6 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,8 @@ ifeq ($(shell uname -s),Darwin)
LD_UTIL_RPATH_FLAGS=""

# Homebrew installs a Protobuf that uses an Abseil that is built with C++17, so we need to build with at least C++17
CXX_STANDARD?=17
# C++20 for spaceship operator and ranges
CXX_STANDARD?=20

# We may need libraries from Macports
ifeq ($(shell if [ -d /opt/local/lib ];then echo 1;else echo 0;fi), 1)
Expand Down Expand Up @@ -229,8 +230,9 @@ else
$(info Compiler $(CXX) is assumed to be GCC)

# gbwtgraph uses inline variables and our oldest supported compiler has
# C++17, so we should use C++17
CXX_STANDARD?=17
# C++17, so we should use at least C++17.
# C++20 for spaceship operator and ranges
CXX_STANDARD?=20

# Set an rpath for vg and dependency utils to find installed libraries
LD_UTIL_RPATH_FLAGS="-Wl,-rpath,$(CWD)/$(LIB_DIR)"
Expand Down Expand Up @@ -820,7 +822,7 @@ $(INC_DIR)/dynamic/dynamic.hpp: $(DYNAMIC_DIR)/include/dynamic/*.hpp $(DYNAMIC_D
+mkdir -p $(INC_DIR)/dynamic && cp -r $(CWD)/$(DYNAMIC_DIR)/include/dynamic/* $(INC_DIR)/dynamic/

$(INC_DIR)/sparsehash/sparse_hash_map: $(wildcard $(SPARSEHASH_DIR)/**/*.cc) $(wildcard $(SPARSEHASH_DIR)/**/*.h)
+cd $(SPARSEHASH_DIR) && ./autogen.sh && LDFLAGS="$(LD_LIB_DIR_FLAGS) $(LDFLAGS)" ./configure --prefix=$(CWD) $(FILTER) && $(MAKE) $(FILTER) && $(MAKE) install
+cd $(SPARSEHASH_DIR) && ./autogen.sh && LDFLAGS="$(LD_LIB_DIR_FLAGS) $(LDFLAGS)" ./configure --prefix=$(CWD) $(FILTER) && $(MAKE) src/sparsehash/internal/sparseconfig.h $(FILTER) && $(MAKE) install-data $(FILTER)

$(INC_DIR)/sparsepp/spp.h: $(wildcard $(SPARSEPP_DIR)/sparsepp/*.h)
+cp -r $(SPARSEPP_DIR)/sparsepp $(INC_DIR)/
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ On other distros, or if you do not have root access, you will need to perform th
liblzma-dev liblz4-dev libffi-dev libcairo-dev libboost-all-dev \
libzstd-dev pybind11-dev python3-pybind11 libssl-dev kmc

At present, you will need GCC version 9 or greater, with support for C++17, to compile vg. (Check your version with `gcc --version`.) GCC up to 11.4.0 is supported.
At present, you will need GCC version 10 or greater, with support for C++20, to compile vg. (Check your version with `gcc --version`.) GCC up to 11.4.0 is supported.

Other libraries may be required. Please report any build difficulties.

Expand Down
2 changes: 1 addition & 1 deletion deps/libvgio
Submodule libvgio updated 1 files
+16 −1 CMakeLists.txt
172 changes: 168 additions & 4 deletions src/algorithms/chain_items.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

//#define debug_chaining
//#define debug_transition
//#define debug_missing_transition
//#define debug_dp

namespace vg {
Expand Down Expand Up @@ -262,6 +263,18 @@
generate_zip_tree_transitions(seeds, zip_code_tree, max_graph_lookback_bases,
seed_to_starting, seed_to_ending);

#ifdef debug_missing_transition
bool has_missing = \
find_missing_zip_tree_transitions(seeds, zip_code_tree, max_graph_lookback_bases,
seed_to_starting, seed_to_ending, distance_index,
all_transitions);
if (has_missing) {
throw std::runtime_error("Zipcode tree iterator failed to output some transitions");
} else {
cerr << "No missing transitions" << endl;
}
#endif

std::vector<transition_info> filtered_transitions =
calculate_transition_read_distances(all_transitions, to_chain, max_read_lookback_bases);

Expand Down Expand Up @@ -385,6 +398,103 @@
return all_transitions;
}

bool find_missing_zip_tree_transitions(

Check warning on line 401 in src/algorithms/chain_items.cpp

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

src/algorithms/chain_items.cpp#L401

Method vg::algorithms::find_missing_zip_tree_transitions has 75 lines of code (limit is 50)

Check warning on line 401 in src/algorithms/chain_items.cpp

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

src/algorithms/chain_items.cpp#L401

Method vg::algorithms::find_missing_zip_tree_transitions has a cyclomatic complexity of 23 (limit is 8)
const std::vector<SnarlDistanceIndexClusterer::Seed>& seeds,
const ZipCodeTree& zip_code_tree,
size_t max_graph_lookback_bases,
const std::unordered_map<size_t, size_t>& seed_to_starting,
const std::unordered_map<size_t, size_t>& seed_to_ending,
const SnarlDistanceIndex& distance_index,
const std::vector<transition_info>& all_transitions) {

// {source anchor : {dest anchor : dist}}
std::unordered_map<size_t, std::unordered_map<size_t, size_t>> found;
for (const auto& transition : all_transitions) {
size_t dist_to_save = transition.graph_distance;
if (!found.count(transition.from_anchor)) {
found[transition.from_anchor] = std::unordered_map<size_t, size_t>();
}
if (found[transition.from_anchor].count(transition.to_anchor)) {
// If a transition appears multiple times, remember the min
dist_to_save = std::min(transition.graph_distance,
found[transition.from_anchor][transition.to_anchor]);
}
found[transition.from_anchor][transition.to_anchor] = transition.graph_distance;
}

bool has_missing = false;

// Helper function to check for a distance between two seeds
auto check_distance = [&] (const ZipCodeTree::oriented_seed_t& from_seed, bool rev_from,
const ZipCodeTree::oriented_seed_t& to_seed, bool rev_to) {
// XOR to get appropriate orientations
rev_from ^= from_seed.is_reversed;
rev_to ^= to_seed.is_reversed;
if (rev_from != rev_to) {
// Cannot be compared; incompatible orientations
return;
}

// Look up appropriate anchors
auto from_anchor_itr = rev_from ? seed_to_starting.find(from_seed.seed)
: seed_to_ending.find(from_seed.seed);
if ((rev_from && from_anchor_itr == seed_to_starting.end())
|| (!rev_from && from_anchor_itr == seed_to_ending.end())) {
// No anchor exists
return;
}
auto to_anchor_itr = rev_to ? seed_to_ending.find(to_seed.seed)
: seed_to_starting.find(to_seed.seed);
if ((rev_to && to_anchor_itr == seed_to_ending.end())
|| (!rev_to && to_anchor_itr == seed_to_starting.end())) {
// No anchor exists
return;
}

// Construct seed positions
pos_t from_pos = seeds.at(from_seed.seed).pos;
size_t from_length = distance_index.minimum_length(distance_index.get_node_net_handle(id(from_pos)));
from_pos = rev_from ? reverse(from_pos, from_length)
: from_pos;
pos_t to_pos = seeds.at(to_seed.seed).pos;
size_t to_length = distance_index.minimum_length(distance_index.get_node_net_handle(id(to_pos)));
to_pos = rev_to ? reverse(to_pos, to_length)
: to_pos;

// Look up true minimum distance
size_t true_distance = minimum_nontrivial_distance(distance_index, from_pos, to_pos);
if (true_distance <= max_graph_lookback_bases) {
// We should've found this transition
auto from_anchor = from_anchor_itr->second;
auto to_anchor = to_anchor_itr->second;
if (!found.count(from_anchor)
|| !found[from_anchor].count(to_anchor)
|| found[from_anchor][to_anchor] != true_distance) {
has_missing = true;
cerr << "Missing transition " << from_pos << "->"
<< to_pos << " dist " << true_distance << endl;
}
}
};

vector<ZipCodeTree::oriented_seed_t> tree_seeds = zip_code_tree.get_all_seeds();
for (size_t i = 0; i < tree_seeds.size(); i++) {
// Check self-loops
check_distance(tree_seeds[i], false, tree_seeds[i], false);
check_distance(tree_seeds[i], false, tree_seeds[i], true);
check_distance(tree_seeds[i], true, tree_seeds[i], false);
for (size_t j = i + 1; j < tree_seeds.size(); j++) {
// Check all orientation pairs
check_distance(tree_seeds[i], false, tree_seeds[j], false);
check_distance(tree_seeds[i], false, tree_seeds[j], true);
check_distance(tree_seeds[i], true, tree_seeds[j], false);
check_distance(tree_seeds[i], true, tree_seeds[j], true);
}
}

return has_missing;
}

std::vector<transition_info> calculate_transition_read_distances(
const std::vector<transition_info>& all_transitions,
const VectorView<Anchor>& to_chain,
Expand Down Expand Up @@ -547,6 +657,8 @@
cerr << "Chaining group of " << to_chain.size() << " items" << endl;
}

crash_unless(recomb_penalty >= 0);

// Compute a base seed average length.
// TODO: Weight anchors differently?
// TODO: Will this always be the same for all anchors in practice?
Expand All @@ -557,6 +669,20 @@
base_seed_length /= to_chain.size();

chain_scores.resize(to_chain.size());

// We want to prefer to come from seeds where the transition preserves
// access to matching haplotypes, because we don't want to back ourselves
// into a corner where we need a recombination when we don't really have
// to. So we cheat on the dynamic programming by adding an "evaluation
// bonus" to the scores of the different DP options when comparing them. We
// keep this bonus out of the actual recorded scores because we don't want
// it raising the scores we actually get the more transitions we take.
//
// We store the bonus used to select the current winning predecessor for
// each seed in this vector, which runs alongside the DP table.
//
// Starting from nowhere means full path conservation, so bonus = recomb_penalty.
std::vector<int> eval_bonuses(to_chain.size(), recomb_penalty);
for (size_t i = 0; i < to_chain.size(); i++) {
// Set up DP table so we can start anywhere with that item's score, scaled and with bonus applied.
chain_scores[i] = {(int)(to_chain[i].score() * item_scale + item_bonus), TracedScore::nowhere(), to_chain[i].anchor_end_paths()};
Expand Down Expand Up @@ -586,8 +712,20 @@
}

// If we come from nowhere, we get those points.
chain_scores[transition.to_anchor] = std::max(chain_scores[transition.to_anchor],
{(int)item_points, TracedScore::nowhere(), here.anchor_end_paths()});
// This also has full path conservation (bonus = recomb_penalty).
{
TracedScore from_nowhere = {(int)item_points, TracedScore::nowhere(), here.anchor_end_paths()};
int nowhere_bonus = recomb_penalty;
int eval_nowhere = from_nowhere.score + nowhere_bonus;
int eval_current = chain_scores[transition.to_anchor].score + eval_bonuses[transition.to_anchor];
if (eval_nowhere > eval_current) {
chain_scores[transition.to_anchor] = from_nowhere;
eval_bonuses[transition.to_anchor] = nowhere_bonus;
} else if (eval_nowhere == eval_current && from_nowhere > chain_scores[transition.to_anchor]) {
chain_scores[transition.to_anchor] = from_nowhere;
eval_bonuses[transition.to_anchor] = nowhere_bonus;
}
}

// For each source we could come from
auto& source = to_chain[transition.from_anchor];
Expand Down Expand Up @@ -664,8 +802,34 @@
TracedScore from_source_score = source_score.add_points(jump_points + item_points)
.set_shared_paths(here.anchor_paths());

// Remember that we could make this jump
chain_scores[transition.to_anchor] = std::max(chain_scores[transition.to_anchor], from_source_score);
// Evaluate heuristic to preserve path flexibility without inflating actual scoring DP.
// Bonus = fraction of conserved paths * recomb_penalty.
// Bonus is 0 when recombination occurs (no shared paths).
int eval_bonus_from = 0;
if (recomb_penalty > 0) {
int pre_count = __builtin_popcountll(source_score.paths);
if (pre_count > 0 && (source_score.paths & here.anchor_start_paths()) != 0) {
// No recombination: bonus = fraction of paths conserved * penalty
int post_count = __builtin_popcountll(from_source_score.paths);
eval_bonus_from = (recomb_penalty * post_count) / pre_count;
}
// Recombination case (no shared paths): bonus stays 0
}

// Grab the DP table slot we are updating
auto& current_best = chain_scores[transition.to_anchor];
// Compute the evaluation value for the new candidate
int eval_from = from_source_score.score + eval_bonus_from;
// Reconstruct the evaluation value for the current winner
int eval_best = current_best.score + eval_bonuses[transition.to_anchor];

if (eval_from > eval_best || (eval_from == eval_best && from_source_score > current_best)) {
// Using the evaluation values, and then if tied the real DP
// scores, this new candidate beats the previous winner, so
// replace it.
current_best = from_source_score;
eval_bonuses[transition.to_anchor] = eval_bonus_from;
}

if (show_work) {
#ifdef debug_dp
Expand Down
17 changes: 17 additions & 0 deletions src/algorithms/chain_items.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -477,6 +477,23 @@ std::vector<transition_info> generate_zip_tree_transitions(
const std::unordered_map<size_t, size_t>& seed_to_starting,
const std::unordered_map<size_t, size_t>& seed_to_ending);

/**
* Check if all possible transitions were actually found.
*
* Iterates over all pairs of seeds and uses the distance index
* to determine if there SHOULD have been a transition.
*
* Returns if any transitions were missing.
*/
bool find_missing_zip_tree_transitions(
const std::vector<SnarlDistanceIndexClusterer::Seed>& seeds,
const ZipCodeTree& zip_code_tree,
size_t max_graph_lookback_bases,
const std::unordered_map<size_t, size_t>& seed_to_starting,
const std::unordered_map<size_t, size_t>& seed_to_ending,
const SnarlDistanceIndex& distance_index,
const std::vector<transition_info>& all_transitions);

/**
* Calculate read distances for each of the zip tree's transitions.
* Also filters out transitions that can't be used,
Expand Down
Loading