You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following H3N2 HA tree has a corresponding clades TSV file that defines a clade A.1.4 with an immediate child clade A.1.4.5 with a HA1:45N substitution (the yellow clade closer to the top of the figure). It also defines a more derived clade A.1.4.2.3 which happens to have a HA1:45N substitution along with several other substitutions and an intermediate parent clade (A.1.4.2).
When augur clades runs on this tree with those definitions, it first assigns the A.1.4.5 label to the larger yellow clade in the screenshot (lower on the figure) because this node has the required mutations and the most descendant leaves. Then, augur clades overwrites that label with the more derived A.1.4.2.3 later on in its loop through all clade definitions. As a result, when augur clades checks for clades that are missing from the tree at the end of the assignment logic, it reports A.1.4.5 as missing. This overwriting of other clades happens multiple times in this dataset as shown by the following output from a local version of augur clades that I modified to report when a node is first assigned a clade and then when that node’s assignment is overwritten:
assigning NODE_0000002 to clade A.1
assigning NODE_0000025 to clade A.1.2
assigning NODE_0000015 to clade A.1.3
assigning NODE_0000743 to clade A.1.4
overwriting NODE_0000015's clade of A.1.3 with A.1.5
assigning NODE_0000026 to clade A.1.2.1
assigning NODE_0000027 to clade A.1.2.2
assigning NODE_0000522 to clade A.1.4.1
assigning NODE_0000439 to clade A.1.4.2
assigning NODE_0000175 to clade A.1.4.3
assigning NODE_0000091 to clade A.1.4.4
overwriting NODE_0000522's clade of A.1.4.1 with A.1.4.5
assigning NODE_0000049 to clade A.1.2.2.1
assigning NODE_0000039 to clade A.1.2.2.2
assigning NODE_0000450 to clade A.1.4.2.1
assigning NODE_0000467 to clade A.1.4.2.2
overwriting NODE_0000522's clade of A.1.4.5 with A.1.4.2.3
assigning NODE_0000736 to clade A.1.4.2.4
assigning NODE_0000196 to clade A.1.4.3.1
assigning NODE_0000362 to clade A.1.4.3.2
assigning NODE_0000283 to clade A.1.4.3.3
overwriting NODE_0000362's clade of A.1.4.3.2 with A.1.4.3.4
overwriting NODE_0000362's clade of A.1.4.3.4 with A.1.4.3.5
assigning NODE_0000503 to clade A.1.4.2.2.1
assigning NODE_0000562 to clade A.1.4.2.3.1
WARNING in augur.clades: clade 'A' not found in tree!
WARNING in augur.clades: clade 'A.1.1' not found in tree!
WARNING in augur.clades: clade 'A.1.3' not found in tree!
WARNING in augur.clades: clade 'A.1.4.5' not found in tree!
WARNING in augur.clades: clade 'A.1.4.3.2' not found in tree!
WARNING in augur.clades: clade 'A.1.4.1' not found in tree!
WARNING in augur.clades: clade 'A.1.4.3.4' not found in tree!
Five of the clades that are “not found in tree” were actually found at one point and then discarded when a later clade in the clades TSV overwrote their assignment.
Expected behavior
In a way, augur clades is behaving the way it was originally expected it to, by resolving conflicts with the largest clade. However, this edge case represents a new situation that we didn’t consider in the original implementation. What I would expect to happen instead is for augur clades to assign the node that best represents A.1.4.2.3 first and then assign A.1.4.5 to the next best remaining node. Another way of generally stating that expected behavior is that clades should be assigned in order of most to least precisely defined and once a node has been assigned a clade that assignment should not be overwritten. James noted that this would be more complicated to implement and would require tests to verify that we get the expected behavior.
Possible solution
A simple, heuristic approach would be to loop through the possible clade designations in order from most to least clade-defining mutations instead of the current approach of looping through clades in the order they appear in dictionary. We’d need an additional check here to prevent a node that has already been assigned to a clade from being considered again. After trying a rough prototype of this, it doesn’t actually work and produces the following tree where A.1.4.5 descends from A.1.4.2.3 and A.1.4.1.
A proper approach would probably need to account for the explicit hierarchical structure in the clade definitions and require multiple passes through the tree to refine assignments or a breadth-first search of the tree.
The text was updated successfully, but these errors were encountered:
Current Behavior
The following H3N2 HA tree has a corresponding clades TSV file that defines a clade A.1.4 with an immediate child clade A.1.4.5 with a HA1:45N substitution (the yellow clade closer to the top of the figure). It also defines a more derived clade A.1.4.2.3 which happens to have a HA1:45N substitution along with several other substitutions and an intermediate parent clade (A.1.4.2).
The relevant clade definitions look like this:
When augur clades runs on this tree with those definitions, it first assigns the A.1.4.5 label to the larger yellow clade in the screenshot (lower on the figure) because this node has the required mutations and the most descendant leaves. Then, augur clades overwrites that label with the more derived A.1.4.2.3 later on in its loop through all clade definitions. As a result, when augur clades checks for clades that are missing from the tree at the end of the assignment logic, it reports A.1.4.5 as missing. This overwriting of other clades happens multiple times in this dataset as shown by the following output from a local version of augur clades that I modified to report when a node is first assigned a clade and then when that node’s assignment is overwritten:
Five of the clades that are “not found in tree” were actually found at one point and then discarded when a later clade in the clades TSV overwrote their assignment.
Expected behavior
In a way, augur clades is behaving the way it was originally expected it to, by resolving conflicts with the largest clade. However, this edge case represents a new situation that we didn’t consider in the original implementation. What I would expect to happen instead is for augur clades to assign the node that best represents A.1.4.2.3 first and then assign A.1.4.5 to the next best remaining node. Another way of generally stating that expected behavior is that clades should be assigned in order of most to least precisely defined and once a node has been assigned a clade that assignment should not be overwritten. James noted that this would be more complicated to implement and would require tests to verify that we get the expected behavior.
Possible solution
A simple, heuristic approach would be to loop through the possible clade designations in order from most to least clade-defining mutations instead of the current approach of looping through clades in the order they appear in dictionary. We’d need an additional check here to prevent a node that has already been assigned to a clade from being considered again. After trying a rough prototype of this, it doesn’t actually work and produces the following tree where A.1.4.5 descends from A.1.4.2.3 and A.1.4.1.
A proper approach would probably need to account for the explicit hierarchical structure in the clade definitions and require multiple passes through the tree to refine assignments or a breadth-first search of the tree.
The text was updated successfully, but these errors were encountered: