Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to query for nodes in the hash-based database? #79

Open
XzzX opened this issue Dec 10, 2024 · 8 comments
Open

How to query for nodes in the hash-based database? #79

XzzX opened this issue Dec 10, 2024 · 8 comments

Comments

@XzzX
Copy link
Contributor

XzzX commented Dec 10, 2024

How should the interface look like?
You can query the database based on a node hash or a node type for example. However, is this useful / enough? How much value does a list of cached lammps nodes have? You then know the inputs are a bunch of other nodes. I imagine for it to be really useful the queries need to be more extensive. So how should it look like? What is a typical question you want the database to answer?

@pyiron/storage

@ligerzero-ai
Copy link
Contributor

I think querying with conditions and tolerances might be useful. For example, bounds on certain values.

something like node.search(“version” within 1.0 -2.0, flagargument within 400-500. Flagargument2 is value “Fe”)

but honestly I would be happy if it just caches existing results and sped up future calculations. Most of the time I don’t see people querying results but building workflows to get specific properties. In that case they are “querying” the database of results implicitly anyway.

@XzzX
Copy link
Contributor Author

XzzX commented Dec 10, 2024

something like node.search(“version” within 1.0 -2.0, flagargument within 400-500. Flagargument2 is value “Fe”)

Querying the version range is not an issue. How do you handle flagargument if it comes from another node? Do you want to query the output? Or do you want to traverse the graph and query something about the input node?

@ligerzero-ai
Copy link
Contributor

I think we keep it simple and have it only query the result of the specific node that it is searching for.

I don’t see the value in being able to traverse workflow node dependencies with a query and downstream results simply because I don’t think it is a feature that will be used in practice. I could be wrong, but that’s my opinion.

I can count on one hand the times where I look at the raw calculation results that I recall that I’ve done before. On the other hand, me attempting a calculation and it returning a result that was previously calculated is fundamentally a useful user facing feature that doesn’t involve extra user effort. Me being able to query individual calculations or even workflows, not so much.

E.g. I can imagine the number of nodes that depends on a the results of a node calculating the bulk structure of iron containing 2 atoms has an insane amount of children/downstream nodes. It can be surfaces, grain boundaries, dislocations, vacancies… and so on. How would you browse this tree as a human to find what you are looking for? Unless they are utilising prebuilt macros that tag these results as specific human readable labels, they are incomprehensible. How would you navigate this tree with queries to find what is useful?

This seems like a hard problem, and one that doesn’t feel like it needs a solution since it is of very little value. If people complain and they want this, then we can think again.

@ligerzero-ai
Copy link
Contributor

So to answer your question, I would have thought that the cache would store raw output/input values. Since that is what is required to identify if something is identical, when hashing.

@XzzX
Copy link
Contributor Author

XzzX commented Dec 10, 2024

So to answer your question, I would have thought that the cache would store raw output/input values. Since that is what is required to identify if something is identical, when hashing.

Yes, it does. Ok, so searching is not so important for the moment and we will skip it. We focus on the caching of simulation results.

@liamhuber
Copy link
Member

liamhuber commented Dec 11, 2024

So to answer your question, I would have thought that the cache would store raw output/input values. Since that is what is required to identify if something is identical, when hashing.

Yes, it does.

It does? I thought it stored raw input only it was a sufficiently primitive form and otherwise stored a link to an upstream node (repeat until you finally get all sufficiently primitive output). Similarly, I thought it stored raw output only if it were a sufficiently primitive form, otherwise it stored a link to serialized data and the knowledge of how to deserialize it (which is why I'm working on a hierarchical Pickel-compliant serializer, so we can exploit the hierarchy to only deserialize the interesting output variable and not the whole node or even graph that led to it, which is anyhow stored more abstractly in the hash database (sparser and more efficient, but sufficient for verifying the output is the one we're looking for).

Or is this just a difference of semantics where you're considering the hashed inputs and file reference outputs as "raw" since they have some sort of string representation?

Edit: wait, the original statement has both "cache" and "hash", so maybe this is simply me getting completely confused. I agree the cache, as it's currently implemented, does hold the raw values.

@ligerzero-ai
Copy link
Contributor

The noderesult database should store the inputs and outputs no? The hash is only for lookup. To me, that is the implementation that makes the most sense. If an input is not compliant with supported storage types (even after traversing to the top node), we simply give up. Isn’t this how we are doing this?

in order to access the result storage, inputs must be objects that are hashable in some form. I think limiting the functionality to primitives is a really bad idea. I would prefer to see “if your object input can be hashed”, you have access to the database. So in effect you can recursively attempt to hash complex non primitive inputs and then hash the entire input set in turn. Or is this a bad idea? That’s what I understand from the specs…

@XzzX
Copy link
Contributor Author

XzzX commented Dec 11, 2024

How to hash?

The python hash function is pretty limited. The way @JNmpi did it, and this is what I am also following, is: Create a dictionary of the inputs. If input is a node store the hash of the node as input, if the input is a value, store the value. Currently, I am using json.dumps to convert it to string, I plan to switch to the GenericStorage json backend for this. This way anything the storage backend can handle will be handled here as well. The string is then hashed. This is not the fastest approach, but I think pretty general and it allows us to reuse functionality implemented for GenericStorage.

What is stored?

I store the output of the node to disk using the hdf5 backend of the GenericStorage. The database holds a link to the file. Apart from that, technically, only the hash is needed. However, since Joerg mentioned searching capabilities we currently also store the json string used to create the hash - we can make this optional. This allows for some basic querying using the database. Here we assume the string is not too large, i.e. it does not contain large numpy arrays for example.

So, yes @ligerzero-ai, I am glad this is also what you got out of the spec.

@liamhuber I am still in the dark how your storage solution will look like. Is it usable here? Do you implement the interface from the storage SPEC? Can it be merged with my poc? Will it work for different backends or is it bound to hdf5? From my side I am curious about how you store arbitrary datatypes and how you achieve partial loading? Both is currently not possible with my poc.

@pyiron/storage Do you have different ideas how to calculate the hash?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants