-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add start and end positions in output matrix #135
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, left a few specific comments.
Additional considerations not addressed:
During deduplication of identical peptide sequences post-modification, add protein-positions
sage/crates/sage/src/database.rs
Line 197 in a969ca7
keep.proteins.extend(remove.proteins.iter().cloned()); |
Ensure that positions and proteins identified are co-sorted:
sage/crates/sage/src/database.rs
Line 206 in a969ca7
.for_each(|peptide| peptide.proteins.sort_unstable()); |
Might be best to define a new struct for assigning proteins that includes both the identifier and positions, to preclude any potential bugs from the above.
struct ProteinAssignment {
identifier: String,
// do these need to be usize (8 bytes)? u32 (4 bytes) is probably sufficient...
start: usize,
end: usize,
}
crates/sage/src/peptide.rs
Outdated
@@ -28,6 +28,10 @@ pub struct Peptide { | |||
pub position: Position, | |||
|
|||
pub proteins: Vec<Arc<String>>, | |||
/// What residue does this peptide start at in the protein (1-based inclusive)? | |||
pub start_position: Vec<Arc<usize>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for Arc
- this is a smart pointer (atomic reference counted) allocated on the heap. It's used for proteins: String
(which is already heap allocated) to prevent repeated clones of protein identifiers. Doesn't make sense to use an Arc
for a simple usize in this case!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes sense!
crates/sage/src/enzyme.rs
Outdated
@@ -76,6 +82,7 @@ impl std::hash::Hash for Digest { | |||
fn hash<H: std::hash::Hasher>(&self, state: &mut H) { | |||
self.sequence.hash(state); | |||
self.position.hash(state); | |||
self.start_position.hash(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct behavior (won't necessarily cause a bug though). We use hash
to deduplicate digests prior to creating peptides.
Only sequence
and position
are considered to ensure that we don't accidentally deduplicate peptide sequences that occur on protein termini, which might be assigned termini-specific modifications. Including start_position
will lead to extra (duplicated) digests being created -> more duplicated peptides that must be generated (expensive) and then trashed during de-duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that makes sense. I was thinking about maybe including a list of start indices for each protein in the proteins
list, but that probably doesn't make sense from a peptide-centric search perspective, and other search engines just list the index for the leading protein.
"rank", | ||
"label", | ||
"expmass", | ||
"calcmass", | ||
"measured_mass", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for changing column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the rationale here, although I'm not wedded to them, are to 1) keep the underscore delimiting in the rest of the columns, 2) spell out calculated and experimental/measured, and 3) including number after scan doesn't feel necessary since every entry has "scan=[scan number]".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to it - but the column names have been stable for ~2.5 years so I don't see a huge need to rename them (and thus force people to retool pipelines downstream of Sage). They were originally named like this because sage results (used) to be passed directly to mokapot for rescoring.
crates/sage/src/peptide.rs
Outdated
@@ -373,6 +380,8 @@ impl TryFrom<Digest> for Peptide { | |||
missed_cleavages: value.missed_cleavages, | |||
semi_enzymatic: value.semi_enzymatic, | |||
proteins: vec![value.protein], | |||
start_position: value.start_position, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must be a Vec
of positions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks for the feedback on this! I'll continue working on the PR and mark it ready for review when ready or ask some follow-up questions along the way. |
I'm drafting a PR for #110