Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multistatereporter variable pos frequency #767

Merged

Conversation

ianmkenney
Copy link
Contributor

@ianmkenney ianmkenney commented Jan 16, 2025

Continuation of #712. There are multiple avenues to address the current failing tests. I would like to address those concerns here before continuing in that PR.

@ianmkenney ianmkenney force-pushed the multistatereporter_variable_pos_frequency branch 2 times, most recently from a0e7dbb to a262c2b Compare January 17, 2025 18:25
@ianmkenney ianmkenney changed the title Multistatereporter variable pos frequency and masked arrays [DNM/WIP] Multistatereporter variable pos frequency and masked arrays Jan 17, 2025
@ianmkenney ianmkenney changed the base branch from multistatereporter_variable_pos_frequency to main January 17, 2025 18:33
@ianmkenney
Copy link
Contributor Author

This PR will take over #712.

Accessing data from netCDF returns a masked array, even if no data is
masked. For valid data, extract it and give back just the basic numpy
array. For invalid data, give back a nan array with the proper shape.
@ianmkenney ianmkenney changed the title [DNM/WIP] Multistatereporter variable pos frequency and masked arrays [DNM/WIP] Multistatereporter variable pos frequency Jan 17, 2025
Copy link

codecov bot commented Jan 17, 2025

Codecov Report

Attention: Patch coverage is 97.52066% with 3 lines in your changes missing coverage. Please review.

Project coverage is 85.13%. Comparing base (f3f355a) to head (122836f).
Report is 1 commits behind head on main.

Additional details and impacted files

@ianmkenney ianmkenney changed the title [DNM/WIP] Multistatereporter variable pos frequency Multistatereporter variable pos frequency Jan 17, 2025
@ianmkenney
Copy link
Contributor Author

@ijpulidos @IAlibay @mikemhenry I can't request reviewers, but I think this is ready for review.

Copy link
Contributor

@ijpulidos ijpulidos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks a lot! Great work. Just a couple of non-blocking comments.

raise IndexError
# pull the valid data out of masked array
positions = unit.Quantity(x.data, unit.nanometers)
except (IndexError, KeyError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking comment. Should we try outputting something here when an error is caught? Maybe something warning users that there's missing data or something similar. What are the reasons for the array to be masked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the topic of how MaskedArrays work here. Assuming that a particular coordinate exists in a dimension (e.g. at least some data exists for a given iteration) we always get back a MaskedArray. The MaskedArray will have a data attribute with a numpy array that will be masked by the MaskedArrays mask attribute. If the data was all valid, then we just pull that information out and expose it to the user. If no data exists at that dimension coordinate, then we'll just get an IndexError so we give back a zero array in place of the nonsense data inside the MaskedArray. By doing it this way, a user never has to deal with a MaskedArray.

I think that if the data was masked it's basically the same behavior as if the data didn't exist and we would have expected a zero array anyways. I don't really see much of a reason to give a warning about this since having a giant array of zeros is kind of a warning already. I'd even be careful saying the data is missing since that would imply it should have been there in the first place.

information, 0 would prevent information being written
velocity_interval : int, default 1
the frequency at which to write positions relative to analysis
information, 0 would prevent information being written
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking question. This might be too late now to ask, but why is that we want positions and velocities to be stored at different intervals? Does it make sense to define a state (i.e. with positions and velocities) from different intervals? Just thinking about ways of not having to deal with the "complexity" of having to check for pos and vel intervals independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would depend on the intent of the simulation. I could see that someone interested in doing VAC calculations wouldn't care as much about the positions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the one case I can think of is that you might not want to store the velocities but you do want to store the positions.

I might be wrong, but as far as I remember the "state" is stored in the checkpoint? So you could recover a simulation from the checkpoint, and just store positions in the simulation file.

@IAlibay
Copy link
Contributor

IAlibay commented Jan 17, 2025

I'll be testing this in a Protocol before I review :)

@@ -100,6 +100,12 @@ class MultiStateReporter(object):
analysis_particle_indices : tuple of ints, Optional. Default: () (empty tuple)
If specified, it will serialize positions and velocities for the specified particles, at every iteration, in the
reporter storage (.nc) file. If empty, no positions or velocities will be stored in this file for any atoms.
position_interval : int, default 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we store these values inside of the NetCDF file as additional variables? This would make it a lot easier to know what the spacing is between frames, rather than having to iterate through the positions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm being blind sorry 🙈 - it's below

openmmtools/multistate/multistatereporter.py Outdated Show resolved Hide resolved
# pass zeros as velocities when key is not found (<0.21.3 behavior)
velocities = np.zeros_like(positions)

if 'box_vectors' in storage.variables:
# Restore box vectors.
x = storage.variables['box_vectors'][read_iteration, replica_index, :, :].astype(np.float64)
# TODO: Are box vectors also variably saved?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From wwhat I can tell, they seem to be saved at the same rate as positions - worth double checking though!

Copy link
Contributor Author

@ianmkenney ianmkenney Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is saved at the same rate of positions assuming that the sampler_state's box_vectors is not None (line 1669). I find this line a little odd though since there is ais_periodic property of the reporter (line 204) that follows slightly different logic. I'm not sure which is best to use. @ijpulidos do you have any thoughts on this (I've pinged you directly at the offending line)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it makes sense that these are stored at the same rate of positions, but again, there might be a use case where we want different rates for it? The logic here being that to maintain the thermodynamic states (especially pressure) you need to vary the box size and that's dependent on the positions. I'd argue that having positions without box sizes, or vice versa, is probably something we don't want to deal with.

Try and retrieve values from the netcdf file. If the data is masked,
assume it hasn't been written at that iteration and raise an
`IndexError`. If the box_vectors variable doesn't exist, then a
`KeyError` is raised. Both execptions will bind None to the
`box_vectors` variable.
@@ -1638,7 +1666,7 @@ def _write_sampler_states_to_given_file(self, sampler_states: list, iteration: i

storage = self._storage_dict[storage_file]
# Check if the schema must be initialized, do this regardless of the checkpoint_interval for consistency
is_periodic = True if (sampler_states[0].box_vectors is not None) else False
is_periodic = sampler_states[0].box_vectors is not None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijpulidos is there a reason we aren't using the MultiStateReporter.is_periodic property here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot say for sure. I guess this means that one can have a box_vectors variable in the reporter but not necessarily have box_vectors attribute in the sampler states, and vice versa. I don't know if we ever want those NOT to match. Is there a use case for such scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'd be inclined to keep it the same in this PR, it just struck me as strange. Both lines were introduced in the same commit so I'm going to go on the assumption it was deliberate rather than an oversight.

@ianmkenney
Copy link
Contributor Author

@IAlibay @ijpulidos unless there is more to do here, I think this can be merged!

Copy link
Contributor

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't approve on this repo, but lgtm!

@ijpulidos ijpulidos merged commit c83cb98 into choderalab:main Jan 27, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants