-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFile rewrite design #1254
Comments
Uses the existing DataFormat, but has a different interface to DataFile. The main difference is that no internal pointers are stored to variables. Instead values are set explicitly. File dump("test.nc", File::ReadWrite); int value = dump["val"]; // Read dump["val"] = 3; // Change value (write). Nowhere near complete: Only integers at the moment, no time dependence. Part of the discussion in issue #1254
Quickly put together a minimal working version of an interface in branch "new-fileio". Header file code: https://github.com/boutproject/BOUT-dev/blob/new-fileio/include/bout/fileio.hxx Example code: https://github.com/boutproject/BOUT-dev/blob/new-fileio/examples/fileio/test-fileio.cxx I quite like being able to read data using
which is explicit but a bit ugly, or maybe have another class (
Perhaps that might be clearer, since Should assigning one
I think a reasonable interpretation is that "val" in the file is modified to have the same value and attributes as "another". Someone could also read it as just changing |
I've started looking at this in earnest now, and there are some design decisions we need to make. This post is quite long, but there's a fair bit to consider, and if we're going to break things, it's probably best to make sure we do think about everything! To summarise some downsides of the current approach:
Given that fixing basically any one of these requires significant changes, let's consider a complete rewrite, including breaking changes to the API and physics models. It would be nice if we could mechanically upgrade physics models to use whatever replacement we decide on. We have one option already
Another option might be an interface that immediately reads/writes values, rather than a whole dump.write("field", Field3D{1.0});
Field3D field = dump.read("field"); We'd probably want to use a template <typename T> // but constrained to Options::ValueType members?
void OptionsNetCDF::write(const std::string& name, T value) {
Options options;
options[name] = std::move(value); // still involves a copy, just in the function signature
write(options);
}
template <typename T>
T OptionsNetCDF::read(const std::string& name) {
return read()[name]; // pretty sure this still has a copy
} In either case, we have a slight stumbling block when it comes to writing auxiliary time-evolving variables (that is, variables not added to the solver). For time-evolving variables, void MyModel::outputVars() override {
// OptionsNetCDF API
Options options; // could be passed in or a member of PhysicsModel?
options["field"] = Field3D{1.0};
options["field"].attributes["time_dimension"] = "t";
dump.write(options); // or alternatively return options and have this call within the library?
// Alternative, immediate-write API
dump.write("field", Field3D{1.0}, DumpFile::time_evolving); // Needs an argument for _which_ time dimension?
} I think it should be possible to scan a user's physics model source code and move A variation on the above is to pass an Other things to consider:
Phew! I'm sure I've still missed something important. I'm very interested to people's thoughts on any of this. I think I'm going to press on trying to replace |
Scattered thoughts:
|
We might need to be a bit careful about exactly when some quantities are written. For example, in STORM we normalise the metric components in the |
Thanks John, some really useful points to think about there. Some more scattered thoughts in response!
|
So a pretty severe limitation of netCDF is not being able to get the current extent of individual variables, only the maximum extent of an unlimited dimension (i.e. time). I've opened an issue about it, maybe they will add something -- but even if they do, we would then rely on a bleeding edge version of netCDF. I've bumped into this problem again due to keeping an OptionsNetCDF(filename).write(options); and the following won't work: OptionsNetCDF file(filename);
file.write(options);
file.write(options); I can see two potential ways around this:
I've opted for the second option, which results in the time-dependent variable writing bit looking like this: // NetCDF doesn't keep track of the current for each
// variable (although the underlying HDF5 file does!), so we
// need to do it ourselves. We'll use an attribute in the
// file to do so, which means we don't need to keep track of
// it in the code
int current_time_index;
const auto atts_map = var.getAtts();
const auto it = atts_map.find("current_time_index");
if (it == atts_map.end()) {
// Attribute doesn't exist, so let's start at zero. There
// are various ways this might break, for example, if the
// variable was added to the file by a different
// program. But note, that if we use the size of the time
// dimension here, this will increase every time we add a
// new variable! So zero is probably the only sensible way
// to do this
current_time_index = 0;
} else {
it->second.getValues(¤t_time_index);
}
std::vector<size_t> start_index; ///< Starting index where data will be inserted
std::vector<size_t> count_index; ///< Size of each dimension
// Dimensions, including time
for (const auto& dim : dims) {
start_index.push_back(0);
count_index.push_back(dim.getSize());
}
// Time dimension
start_index[0] = current_time_index;
count_index[0] = 1; // Writing one record
fmt::print("Writing {} at time {}\n", name, current_time_index);
// Put the data into the variable
bout::utils::visit(NcPutVarCountVisitor(var, start_index, count_index),
child.value);
// Make sure to update the time index, including incrementing it!
var.putAtt("current_time_index", ncInt, ++current_time_index); This will break if someone adds a time-dependent variable to the file without adding the The first option is likely more robust, but as we'll need to close the file first, I think that might be quite complicated in practice. |
For context, the current After chatting with @johnomotani, we realised that one problem with my proposed method above is that it's possible for different variables to get out of sync, whereas with John suggested a A separate thought, but one I want to write down somewhere -- we have a time dimension |
I think a netCDF dimension is only an array-size: the 'dimension' has to be a list of consecutive integers starting from 0. |
That's the dimension IDs, rather than the dimensions themselves. It turns out if you just define a dimension like we currently do, netCDF basically just makes an empty dataset. If you then also define a variable with the same name, you can then give it values. Here's one of their examples demonstrating this: https://www.unidata.ucar.edu/software/netcdf/docs/pres__temp__4D__wr_8c_source.html Thinking a little bit more about keeping track of the last time index for each variable -- instead of needing to call |
Ah, OK - it's a convention https://www.unidata.ucar.edu/software/netcdf/workshops/2011/datamodels/NcCVars.html (a very logical one!), and a convention that's used by |
That seems sensible to me as long as the main library does take care of the checking for the standard output If a user (or e.g. FastOuput) want to define a different unlimited dimension and append to it, they can be responsible for adding the check themselves. BTW I just found out (https://www.unidata.ucar.edu/software/netcdf/docs/unlimited_dims.html) that in netCDF-4 files you're allowed any number of unlimited dimensions, but in netCDF 'classic' you can only have one - another reason to only support netCDF-4 in v5. |
I've made some decent progress on this, but there's (at least) three points that need a bit of thinking about:
For the first question, my current design adds: /// Output additional variables other than the evolving variables
virtual void outputVars(MAYBE_UNUSED(Options& options)) {}
/// Add additional variables other than the evolving variables to the restart files
virtual void restartVars(MAYBE_UNUSED(Options& restart)) {} to I'm hoping this will make it reasonably easy to automate an upgrade. Does the separation into two methods make sense? For the second point, currently
Currently, it is definitely easiest for either the model or the solver to own the output file, as they have pointers to each other, and can easily call mesh and coordinates methods. Ideally, I think there'd be a using PhysicsModel::PhysicsModel; to user models. This is definitely possible to automate, and we're already have a physics model upgrader tool. I'm trying to encapsulate things enough that such a refactoring wouldn't be too hard.
options["field"] = field; // Need to use .force() for ints and BoutReals
options["field"].attributes["time_dimension"] = "t"; which is both a bit unwieldy, and has more surface area for mistakes. I keep coming back to something like: template <class T>
Options::assignRepeat(T value, std::string time_dimension = "t", std::string source = "") {
force(value, source);
attributes["time_dimension"] = time_dimension;
}
options["field"].assignRepeat(field); which I'm not overly happy with, but might be the best we can do? There's still the risk of mixing up I'd love to hear peoples thoughts on any of these points. |
A complication for automatic upgrading: some variables are only conditionally written. Here's an example from if (evolve_ajpar) {
solver->add(Ajpar, "Ajpar");
comms.add(Ajpar);
output.write("ajpar\n");
} else {
initial_profile("Ajpar", Ajpar);
if (ZeroElMass) {
SAVE_REPEAT(Ajpar); // output calculated Ajpar
}
} That's basically going to be impossible to correctly translate with regexp. There is a clang Python API that might make it possible, but it's still probably going to be tricky. I think @bendudson suggested storing pointers in |
Thanks @ZedThree I think there might be a way to keep some kind of backward compatibility:
So if a user wanted to write different things to output, they could implement this I'm undecided whether passing in an One thing I'm not sure about is how to add multiple monitors. A different virtual function for high frequency output ( |
That looks good! I'll have a go at implementing that. One slightly annoying thing is that we basically want the
We currently don't have a way to "flat-merge" |
I would also like to see a way to decouple the |
Ah yes, I guess a flat-merge would be needed at the moment. Eventually it would be nice to have each |
I think the issue is that xarray doesn't read groups out of the box, but if you know what groups to read, it works? But yes, I would also like more structure in the output file! We were talking about saving the inputs from each restart in some group structure too. |
Had a thought about the different monitors: When making quantities in the output time-dependent, the label of the time coordinate is specified. The
The default is "t", but if a monitor is created which outputs at a different rate, it would use a different time dimension? |
Ah good point! We'll need to keep track of Also, I realised that merging |
There's been an on-and-off debate on |
The replacement of One thing I've ignored so far is the various options to OPTION(opt, parallel, false); // By default no parallel formats for now
OPTION(opt, flush, true); // Safer. Disable explicitly if required
OPTION(opt, guards, true); // Compatible with old behavior
OPTION(opt, floats, false); // High precision by default
OPTION(opt, openclose, true); // Open and close every write or read
OPTION(opt, enabled, true);
OPTION(opt, init_missing, false); // Initialise missing variables?
OPTION(opt, shiftoutput, false); // Do we want to write 3D fields in shifted space?
OPTION(opt, shiftinput, false); // Do we want to read 3D fields in shifted space?
OPTION(opt, flushfrequency, 1); // How frequently do we flush the file Of these, I've only kept For the others:
Which just leaves |
I'm not sure what |
Yeah, Ok, next stumbling block: Off the top of my head, there's two routes we could go down:
The first option is probably easy to implement, but would mean we necessarily have to read in the entire gridfile on every processor. The second option is probably nicer in terms of memory use, etc., and is probably more amenable to eventually using parallel netCDF down the line again perhaps? But poses more questions in terms of API. Maybe just an overload of Options data = OptionsNetCDF{filename}.read(mesh); (maybe optional rank too for testing?) |
Oh, the other difficulty with option 2 is that when we read in the data in There's also the question of older files needing FFTs. |
There are various issues related to the current DataFile implementation e.g issues #222, #221, #102, #412, #374, #367, #644 . The current design is convenient in many cases, but leads to nasty surprises when more complicated things are attempted. As a reminder, in the current system:
DataFile
provides a consistent interface to variousDataFormat
classes, which each implement interfaces to the NetCDF, HDF libraries etc.DataFile
stores pointers to these objects, which therefore can't be destroyed (e.g. go out of scope):dump.read()
loads all data into the fieldsdump.write()
saves data from all the fields.This makes sense for data which is all output together at fixed time points (e.g. classic NetCDF), but is too restrictive if we want more than one time index (e.g. high, low frequency). It also makes adding diagnostics cumbersome, since temporary variables must be created and added to store the diagnostic values.
I think a new design should have a clearer read() and write() semantics, so to read a value becomes something like:
and writing a value is something like:
There are many different ways of doing this, and several ways we could handle appending time indices. Probably good to look at how other C++11 libraries handle things. For example: https://github.com/BlueBrain/HighFive
The text was updated successfully, but these errors were encountered: