-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize decorator data at the end #1531
Serialize decorator data at the end #1531
Conversation
*MAGIC, magic | ||
))); | ||
} | ||
read_and_validate_magic(source)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a single line of code for each item being read, improves readability of this method
2ade28c
to
eb59ff3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Looks good to me in general.
I think I have two comments:
- How do we actually support dropping the decorator data (see inline comment)? And if this is non-trivial, do we perhaps need to add a public utility function
fn mast_forest_drop_decorators(serialized_mast_forest: &mut Vec<u8>)
to do this? (An honest question - I'm not sure). - Can you add one or more tests to ensure that we can serialize a MastForest, drop its decorator data and deserialize it successfully and that all its decorator data is gone? It seems to me that the serialization tests are somewhat rare, so adding more seems like a good idea.
core/src/mast/serialization/mod.rs
Outdated
// decorator & node counts | ||
target.write_usize(self.decorators.len()); | ||
// node counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure in which context we intend to actually drop the decorator data from the serialized representation, but I'm wondering: How do we actually do that? We only have a Vec<u8>
so wouldn't we need the length of the decorator data at some fixed offset to read it and then be able to drop the appropriate amount of bytes from the end? Since we can't retroactively write at an offset at the beginning of the section, perhaps we could add this offset at the very end of the layout.
<MAGIC>
<VERSION>
...
<DECORATOR_DATA_OFFSET_FROM_BACK: u64>
This way we could read the u64
(or whatever makes sense, it just needs a fixed size) from the back of the Vec<u8>
. Then we Vec::truncate
the appropriate amount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PhilippGackstatter Not sure what the best mechanism would be, but your proposal seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure in which context we intend to actually drop the decorator data from the serialized representation, but I'm wondering: How do we actually do that?
I think the general idea is that producers, e.g. the compiler, will emit a fully-fleshed out package containing all the things that could be useful for interacting with that package (e.g. debug info, documentation, sources in some caes, etc.), and the MAST contained in the package would have all of the decorators (naturally required to support debugging, etc.). There are contexts where all of that stuff is simply dead weight though, or you want to optimize for binary size, in which case you can strip all of that, resulting in a minimal package that only contains exactly what is needed to execute what it contains. One such potential use case, would be publishing packages on-chain, which would only be feasible if they are as small as possible.
My assumption was that we stored decorators in a separate section, whose offset and size is recorded in the header of the binary format. Decorator references are simply relative offsets to the decorator data stored in the decorator section (i.e. the offset is relative to the base offset of the decorator section). Thus, truncating the decorators conceptually involves splitting the binary at the boundaries of the decorator section, then stitching the parts we're keeping back together, and setting the decorator section size and offset to 0
. Decorator references would remain, but would be ignored since there is no decorator section to resolve them with.
Stripping both the decorator data and the decorator references, probably requires reconstructing the MAST from scratch, since I don't think we have any means by which to keep both references and the data separate from everything else. One slightly less drastic approach might be to visit all the decorator references and zero them out, so that compressing the binary format essentially erases any overhead associated with leaving them behind.
I think fundamentally though, the goal is to provide a fast and efficient means of stripping optional bits from the MAST, without rebuilding it from scratch. Optional sections would be kept separate as described above, so that you don't have to really know anything about what is stored in them to remove them.
That's my take anyway!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation on where this is useful!
Stripping both the decorator data and the decorator references, probably requires reconstructing the MAST from scratch, since I don't think we have any means by which to keep both references and the data separate from everything else.
Maybe I'm missing something but I think that is possible. Currently the before-enter and after-exit decorator data is already completely separate. It's a Vec<(MastNodeId, Vec<DecoratorId>)>
where for each mast node id we add the vector of decorator ids. If that is missing, we simply wouldn't add the decorators to the corresponding mast nodes.
For BasicBlock
s we currently store the DecoratorList
alongside the basic block data, but since it is just referencing ops in the block and decorators, (it's a Vec<(usize, DecoratorId)>
) we could move it to the decorator section and turn it into a Vec<(MastNodeId, Vec<(usize, DecoratorId)>)>
, so that it would work conceptually the same as the other decorators for deserialization: If that list is missing, we simply don't add it to the basic block.
I think that would be the optimal layout as it would allow us to remove all decrator data. Taking your comments into account also, the layout would then be:
//! The serialization format of MastForest is as follows:
//!
//! (Metadata)
//! - MAGIC
//! - VERSION
//!
//! (sections metadata)
//! - nodes length (`usize`)
//! - decorator data section offset (`usize`)
//! - decorators length (`usize`)
//!
//! (procedure roots section)
//! - procedure roots (`Vec<MastNodeId>`)
//!
//! (basic block data section)
//! - basic block data (previously called "node data", but I think it is more precise now)
//!
//! (node info section)
//! - MAST node infos (`Vec<MastNodeInfo>`)
//!
//! (decorator data section)
//! - Decorator data
//! - String table
//!
//! (decorator info section)
//! - decorator infos (`Vec<DecoratorInfo>`)
//!
//! (basic block decorator lists section)
//! - basic block decorator lists (`Vec<(MastNodeId, Vec<(usize, DecoratorId)>)>`)
//!
//! (before enter and after exit decorators section)
//! - before enter decorators (`Vec<(MastNodeId, Vec<DecoratorId>)>`)
//! - after exit decorators (`Vec<(MastNodeId, Vec<DecoratorId>)>`)
@yasonk So for this PR we would have to change:
- Add decorator data section offset to the header. Contrary to what I said before this is actually possible because we serialize the various data into
Vec<u8>
s first and then write those to the target, so we can inspect it for its length and add it to the header before writing those data. - Serialize
DecoratorList
ofBasicBlocks
similarly to before-enter/after-exit decorators.
I think we also should add a MastForest::truncate_decorators(serialized_mast_forest: &mut Vec<u8>) -> Result<(), MastForestError>
function which validates the MAGIC and VERSION and attempts to read the sections metadata. If it is all valid it drops the decorator data from the end and updates the decorator data section offset
and decorators length
.
Finally, we should add tests as mentioned before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PhilippGackstatter Do you think it may be helpful to split this work into 2 PRs?
- Place all decorator data at the end. (Good catch with the decorator list inside basic blocks)
- Truncate decorators.
For (1), I would leave the current tests that make sure the data is serialized and deserialized correctly. Also include a fix for extracting decorator lists that are inside node_data
.
For (2), we would implement the truncate_decorators
along with adding state/logic needed to make that happen. E.g. updating decorator data section offset
and decorators_length
.
This would obviously require new tests to check that the decorators are properly removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it may be helpful to split this work into 2 PRs?
Whatever you prefer, I think both works. The truncate_decorators
function shouldn't be a very large addition. Adding some tests for it and making sure that deserilization still works fine would be more code to add I would guess. Doing 2 PRs is also fine. After (1) the code should also be in a consistent state overall, that's the important part.
Btw re-reading the layout again I proposed, the basic block decorator lists
should be a vector - I edited my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PhilippGackstatter I decided to first get the decorator data grouped. Will create the truncate_decorators
function and tests in the next PR.
I think I addressed all items, you suggested. Please check if I missed anything.
b57dbdd
to
8449721
Compare
core/src/mast/mod.rs
Outdated
@@ -198,6 +196,10 @@ impl MastForest { | |||
Some(id_remappings) | |||
} | |||
|
|||
pub fn set_decorators(&mut self, node_id: MastNodeId, decorator_list: DecoratorList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created this to keep the code consistent. But this method should probably be replaced by
mast_forest.get_node_by_id(node_id).set_decorators(decorator_list)
Same for the set_before_enter
and set_after_exit
methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can replace it with an explicit match where we use this method, see here: #1531 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Functionality-wise it seems to work as intended, which is great.
I think we can make use of some existing methods rather than introduce new ones like MastForest::set_decorators
for less public methods and more safety (see inline comments). Other comments are mostly nits and optimizations.
I think you may also need to rebase your commits and sign them (if you haven't done this before, this may help: https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits). At least the merge window at the bottom is complaining about unsigned commits.
core/src/mast/serialization/mod.rs
Outdated
let node_data = basic_block_data_builder.finalize(); | ||
let string_table = string_table_builder.into_table(); | ||
node_data.write_into(target); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node_data.write_into(target); | |
basic_block_data.write_into(target); |
Nit: We should align the name with the serialization format description at the beginning of the file.
core/src/mast/serialization/mod.rs
Outdated
let roots: Vec<u32> = Deserializable::read_from(source)?; | ||
let decorator_data: Vec<u8> = Deserializable::read_from(source)?; | ||
|
||
// Reading nodes | ||
let node_data: Vec<u8> = Deserializable::read_from(source)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let node_data: Vec<u8> = Deserializable::read_from(source)?; | |
let basic_block_data: Vec<u8> = Deserializable::read_from(source)?; |
Nit: Same here.
core/src/mast/serialization/mod.rs
Outdated
@@ -232,35 +229,105 @@ impl Deserializable for MastForest { | |||
mast_forest | |||
}; | |||
|
|||
let basic_block_decorators: Vec<(usize, DecoratorList)> = read_block_decorators(source)?; | |||
for (node_id, decorator_list) in basic_block_decorators { | |||
let node_id = MastNodeId::try_from((node_id, &mast_forest))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Instead of the TryFrom with a tuple, I would add a from_usize_safe
method analogous to from_u32_safe
. This seems more in line with the existing API and I don't think there's a benefit to implementing the trait.
core/src/mast/serialization/mod.rs
Outdated
let basic_block_decorators: Vec<(usize, DecoratorList)> = read_block_decorators(source)?; | ||
for (node_id, decorator_list) in basic_block_decorators { | ||
let node_id = MastNodeId::try_from((node_id, &mast_forest))?; | ||
mast_forest.set_decorators(node_id, decorator_list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mast_forest.set_decorators(node_id, decorator_list); | |
match &mut mast_forest[node_id] { | |
MastNode::Block(basic_block) => { | |
basic_block.init_decorators(decorator_list); | |
}, | |
other => { | |
return Err(DeserializationError::InvalidValue(format!( | |
"expected mast node with id {node_id} to be a basic block, found {other:?}" | |
))) | |
}, | |
} |
I would prefer doing something like this rather than implement set_decorators
on MastForest
which really only supports basic blocks. If we have invalid serialized bytes and the node_id
would point to a Join node for example, then set_decorators
would silently ignore that this is an error. So we should explicitly handle it so deserialization fails if we encounter something like that.
core/src/mast/serialization/mod.rs
Outdated
fn read_decorator_infos<R: ByteReader>( | ||
source: &mut R, | ||
decorator_count: usize, | ||
) -> Result<Vec<DecoratorInfo>, DeserializationError> { | ||
let mut decorator_infos: Vec<DecoratorInfo> = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we wouldn't need to read this to a Vec
first. Where this is called, we could iterate over an iterator that reads the next DecoratorInfo
from source
and returns it. But I think this requires a custom iterator which is probably overkill.
As an alternative optimization, let's use Vec::with_capacity(decorator_count)
here so we only allocate once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to implement the iterator since it isn't that much more code. Also implemented a iterator for reading MastNodeInfo
s
core/src/mast/serialization/mod.rs
Outdated
source: &mut R, | ||
node_count: usize, | ||
) -> Result<Vec<MastNodeInfo>, DeserializationError> { | ||
let mut mast_node_infos: Vec<MastNodeInfo> = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut mast_node_infos: Vec<MastNodeInfo> = Vec::new(); | |
let mut mast_node_infos: Vec<MastNodeInfo> = Vec::with_capacity(node_count); |
core/src/mast/serialization/mod.rs
Outdated
let mut inner_vec: Vec<(usize, DecoratorId)> = Vec::with_capacity(decorator_vec_len); | ||
for _ in 0..decorator_vec_len { | ||
let op_id: usize = source.read()?; | ||
let decorator_id: DecoratorId = source.read()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do this analogous to read_before_after_decorators
and use let decorator_id = DecoratorId::from_u32_safe(source.read()?, mast_forest)?;
so we have the additional safety check that the DecoratorId
we're reading points to a valid decorator and is not out of bounds.
That way we don't need impl Deserializable for DecoratorId
as well.
|
||
/// Used to initialize decorators for the [`BasicBlockNode`]. Replaces the existing decorators | ||
/// with the given ['DecoratorList']. | ||
pub fn init_decorators(&mut self, decorator_list: DecoratorList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn init_decorators(&mut self, decorator_list: DecoratorList) { | |
pub fn set_decorators(&mut self, decorator_list: DecoratorList) { |
Nit: set
instead of init
seems more idiomatic.
core/src/mast/node/mod.rs
Outdated
/// Sets the list of decorators for the [`MastNode`]. Currently, is only supported for the | ||
/// [`Block`] variant. | ||
pub fn set_decorators(&mut self, decorator_list: DecoratorList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I would remove this in favor of a match
at the call-site so we can avoid adding this public method and we can handle errors in a better way.
core/src/mast/serialization/mod.rs
Outdated
//! (basic block data section) | ||
//! - basic block data | ||
//! | ||
//! (node info structs) | ||
//! - MAST node infos (`Vec<MastNodeInfo>`) | ||
//! | ||
//! (raw decorator data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe we can call all the sections "section", e.g. "node info section", etc.. This makes it clearer that the offsets in sections metadata point to those sections.
E.g.:
//! The serialization format of MastForest is as follows:
//!
//! (Metadata)
//! - MAGIC
//! - VERSION
//!
//! (sections metadata)
//! - nodes length (`usize`)
//! - decorator data section offset (`usize`)
//! - decorators length (`usize`)
//!
//! (procedure roots section)
//! - procedure roots (`Vec<MastNodeId>`)
//!
//! (basic block data section)
//! - basic block data
//!
//! (node info section)
//! - MAST node infos (`Vec<MastNodeInfo>`)
//!
//! (decorator data section)
//! - Decorator data
//! - String table
//!
//! (decorator info section)
//! - decorator infos (`Vec<DecoratorInfo>`)
//!
//! (basic block decorator lists section)
//! - basic block decorator lists (`Vec<(MastNodeId, Vec<(usize, DecoratorId)>)>`)
//!
//! (before enter and after exit decorators section)
//! - before enter decorators (`Vec<(MastNodeId, Vec<DecoratorId>)>`)
//! - after exit decorators (`Vec<(MastNodeId, Vec<DecoratorId>)>`)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PhilippGackstatter makes sense. I will remove decorator data section offset (
usize)
per request from @plafer since it will be implemented later.
Is there an GitHub issue for truncating decorators already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh with #1531 (comment) I initially assumed the decorator data section was implemented already but the comment was not updated - my bad.
I don't believe there's an issue for it, so yes let's just create an issue and maybe update the comment to:
//! - decorator data section offset (`usize`) (not implemented, see issue #X)
core/src/mast/serialization/mod.rs
Outdated
//! - decorators length (`usize`) | ||
//! - decorator data section offset (`usize`) (NOT YET IMPLEMENTED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's remember to update this before we merge
Rebased and started signing commits. But I don't think I can sign older commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks for addressing the comments!
A few nits about avoiding allocations and removing public functionality (Deserializable for DecoratorId
) from the public interface that we don't need for now.
You can retroactively sign your commits and then force push. Here's one way: https://stackoverflow.com/questions/41882919/is-there-a-way-to-gpg-sign-all-previous-commits. Without signed commits GitHub won't let us merge this PR.
7848ee8
to
2099676
Compare
@PhilippGackstatter I addressed the last 4 comments. Thank you for the great suggestions in all cases! |
@yasonk Thanks for the updates and signing commits! I think just two more things to make CI happy:
You can try locally if the no-std build works without issues using @plafer Should we go ahead and merge after that or do you want to take another look? |
Let me do a quick review. I'll try to do this on Friday. |
c7e0f29
to
d5a3eb8
Compare
bd21506
to
d74ea4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you for the great work!
Describe your changes
Moving all decorator data to the end in order to enable truncating it later. See issue #1489
Contains a minor refactor to make writing decorator info more direct by reducing interconnected layers of indirection.
Also extracted some functions from the Serializer code to make it easer follow.
Checklist before requesting a review
next
according to naming convention.