-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to query for nodes in the hash-based database? #79
Comments
I think querying with conditions and tolerances might be useful. For example, bounds on certain values. something like node.search(“version” within 1.0 -2.0, flagargument within 400-500. Flagargument2 is value “Fe”) but honestly I would be happy if it just caches existing results and sped up future calculations. Most of the time I don’t see people querying results but building workflows to get specific properties. In that case they are “querying” the database of results implicitly anyway. |
Querying the version range is not an issue. How do you handle |
I think we keep it simple and have it only query the result of the specific node that it is searching for. I don’t see the value in being able to traverse workflow node dependencies with a query and downstream results simply because I don’t think it is a feature that will be used in practice. I could be wrong, but that’s my opinion. I can count on one hand the times where I look at the raw calculation results that I recall that I’ve done before. On the other hand, me attempting a calculation and it returning a result that was previously calculated is fundamentally a useful user facing feature that doesn’t involve extra user effort. Me being able to query individual calculations or even workflows, not so much. E.g. I can imagine the number of nodes that depends on a the results of a node calculating the bulk structure of iron containing 2 atoms has an insane amount of children/downstream nodes. It can be surfaces, grain boundaries, dislocations, vacancies… and so on. How would you browse this tree as a human to find what you are looking for? Unless they are utilising prebuilt macros that tag these results as specific human readable labels, they are incomprehensible. How would you navigate this tree with queries to find what is useful? This seems like a hard problem, and one that doesn’t feel like it needs a solution since it is of very little value. If people complain and they want this, then we can think again. |
So to answer your question, I would have thought that the cache would store raw output/input values. Since that is what is required to identify if something is identical, when hashing. |
Yes, it does. Ok, so searching is not so important for the moment and we will skip it. We focus on the caching of simulation results. |
It does? I thought it stored raw input only it was a sufficiently primitive form and otherwise stored a link to an upstream node (repeat until you finally get all sufficiently primitive output). Similarly, I thought it stored raw output only if it were a sufficiently primitive form, otherwise it stored a link to serialized data and the knowledge of how to deserialize it (which is why I'm working on a hierarchical Pickel-compliant serializer, so we can exploit the hierarchy to only deserialize the interesting output variable and not the whole node or even graph that led to it, which is anyhow stored more abstractly in the hash database (sparser and more efficient, but sufficient for verifying the output is the one we're looking for). Or is this just a difference of semantics where you're considering the hashed inputs and file reference outputs as "raw" since they have some sort of string representation? Edit: wait, the original statement has both "cache" and "hash", so maybe this is simply me getting completely confused. I agree the cache, as it's currently implemented, does hold the raw values. |
The noderesult database should store the inputs and outputs no? The hash is only for lookup. To me, that is the implementation that makes the most sense. If an input is not compliant with supported storage types (even after traversing to the top node), we simply give up. Isn’t this how we are doing this? in order to access the result storage, inputs must be objects that are hashable in some form. I think limiting the functionality to primitives is a really bad idea. I would prefer to see “if your object input can be hashed”, you have access to the database. So in effect you can recursively attempt to hash complex non primitive inputs and then hash the entire input set in turn. Or is this a bad idea? That’s what I understand from the specs… |
How to hash?The python What is stored?I store the output of the node to disk using the hdf5 backend of the So, yes @ligerzero-ai, I am glad this is also what you got out of the spec. @liamhuber I am still in the dark how your storage solution will look like. Is it usable here? Do you implement the interface from the storage SPEC? Can it be merged with my poc? Will it work for different backends or is it bound to hdf5? From my side I am curious about how you store arbitrary datatypes and how you achieve partial loading? Both is currently not possible with my poc. @pyiron/storage Do you have different ideas how to calculate the hash? |
How should the interface look like?
You can query the database based on a node hash or a node type for example. However, is this useful / enough? How much value does a list of cached lammps nodes have? You then know the inputs are a bunch of other nodes. I imagine for it to be really useful the queries need to be more extensive. So how should it look like? What is a typical question you want the database to answer?
@pyiron/storage
The text was updated successfully, but these errors were encountered: