Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open HDF5 file only once #1251

Closed
wants to merge 13 commits into from
Closed

Open HDF5 file only once #1251

wants to merge 13 commits into from

Conversation

jan-janssen
Copy link
Member

Example:

import h5py
from pyiron_atomistics import Project

pr = Project("test")
job = pr.create.job.Lammps("lmp")
job.structure = pr.create.structure.ase.bulk("Al", cubic=True)
job.run()

def print_attrs(name, obj):
    print(name)
    for key, val in obj.attrs.items():
        print("    %s: %s" % (key, val))

with h5py.File(job.project_hdf5.file_name, 'r') as f:
    f.visititems(print_attrs)

@jan-janssen jan-janssen marked this pull request as draft December 17, 2023 07:02
@jan-janssen
Copy link
Member Author

This pull request includes the changes from #1247 #1248 #1249 and #1250 so those pull requests should be merged first.

@jan-janssen
Copy link
Member Author

Reduce the number of times the HDF5 file is opened from 77 to 46 - which is a 40% reduction. From the 46 remaining calls 7 are handled via the new write_dict_to_hdf() call and the rest is caused by pyiron_atomistics so there is further potential to optimise this once this pull request is merged by applying the same changes to pyiron_atomistics.

@jan-janssen
Copy link
Member Author

======================================================================
ERROR: test_hdf_pandas (generic.test_datacontainer.TestDataContainer)
Values that implement to_hdf/from_hdf, should write themselves to the HDF file correctly.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/pyiron_base/pyiron_base/tests/generic/test_datacontainer.py", line 500, in test_hdf_pandas
    pl.to_hdf(hdf=self.hdf)
  File "/home/runner/work/pyiron_base/pyiron_base/pyiron_base/interfaces/has_hdf.py", line 223, in to_hdf
    self._to_hdf(hdf)
  File "/home/runner/work/pyiron_base/pyiron_base/pyiron_base/storage/datacontainer.py", line 808, in _to_hdf
    write_dict_to_hdf(hdf=hdf, data_dict=data_dict)
  File "/home/runner/work/pyiron_base/pyiron_base/pyiron_base/storage/helper_functions.py", line 79, in write_dict_to_hdf
    write_hdf5_with_json_support(
  File "/home/runner/work/pyiron_base/pyiron_base/pyiron_base/storage/helper_functions.py", line 61, in write_hdf5_with_json_support
    write_hdf5(
  File "/home/runner/work/pyiron_base/pyiron_base/pyiron_base/storage/helper_functions.py", line 41, in write_hdf5
    retry(
  File "/home/runner/work/pyiron_base/pyiron_base/pyiron_base/utils/error.py", line 145, in retry
    return func()
  File "/home/runner/work/pyiron_base/pyiron_base/pyiron_base/storage/helper_functions.py", line 42, in <lambda>
    lambda: h5io.write_hdf5(
  File "/usr/share/miniconda3/envs/my-env/lib/python3.9/site-packages/h5io/_h5io.py", line 138, in write_hdf5
    _create_pandas_dataset(fname, rootname, key, title, value)
  File "/usr/share/miniconda3/envs/my-env/lib/python3.9/site-packages/h5io/_h5io.py", line 60, in _create_pandas_dataset
    data.to_hdf(fname, rootpath)
  File "/usr/share/miniconda3/envs/my-env/lib/python3.9/site-packages/pandas/core/generic.py", line 2682, in to_hdf
    pytables.to_hdf(
  File "/usr/share/miniconda3/envs/my-env/lib/python3.9/site-packages/pandas/io/pytables.py", line 307, in to_hdf
    f(path_or_buf)
  File "/usr/share/miniconda3/envs/my-env/lib/python3.9/site-packages/pandas/io/pytables.py", line 287, in <lambda>
    f = lambda store: store.put(
AttributeError: 'File' object has no attribute 'put'

======================================================================
FAIL: test_local_defragment_storage (project.test_maintenance.TestMaintenance)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/pyiron_base/pyiron_base/tests/project/test_maintenance.py", line 43, in test_local_defragment_storage
    self._assert_hdf_rewrite()
  File "/home/runner/work/pyiron_base/pyiron_base/tests/project/test_maintenance.py", line 30, in _assert_hdf_rewrite
    self.assertLess(job_hdf.file_size(job_hdf), self.initial_toy_1_hdf_file_size)
AssertionError: 188394 not less than 116548

@jan-janssen jan-janssen added the format_black reformat the code using the black standard label Dec 17, 2023
@pmrv
Copy link
Contributor

pmrv commented Dec 18, 2023

@jan-janssen and I just discussed a slightly more formalized version of this that would add a new HasStorageDict (or whatever name) that would play well with HasHDF and DummyHDFio (here) along these lines

class HasStorageDict(ABC):
  @abstractmethod
  def _to_storage_dict(self):
    pass

  def type_storage_dict(self):
     return {
            "NAME": self.__name__,
            "TYPE": str(type(self)),
      }

  def to_storage_dict(self):
    data_dict = self.type_storage_dict()
    data_dict.update(self._to_storage_dict())
    return data_dict

  @classmethod
  @abstractmethod
  def _from_storage_dict(cls, data_dict):
    pass

and expanding HasHDF

class HasHDF(HasStorageDict):
  ...

  def _to_storage_dict(self):
    hdf = DummyHDFio(...)
    self.to_hdf(hdf)
    return hdf.to_dict()
  @classmethod
  def _from_storage_dict(cls, data_dict):
    hdf = DummyHDFio(..., data_dict)
    obj = cls(**cls._from_hdf_args(hdf))
    obj.from_hdf(hdf)
    return obj

Similarly every HasStorageDict could have a trivial to_hdf

class HasStorageDict(ABC):
  ...

  def to_hdf(self, hdf):
    write_dict_to_hdf(hdf, self._to_storage_dict())

where write_dict_to_hdf comes also from this PR. (A mirroring read_dict_from_hdf seems to be still missing)

This sort of setup would have the advantage that later we could plug in __getstate__ and __setstate__ into HasStorageDict very straight forwardly and leave ourselves still somewhat open to other storage backends in the future.

@jan-janssen jan-janssen added format_black reformat the code using the black standard and removed format_black reformat the code using the black standard labels Dec 20, 2023
@jan-janssen jan-janssen added format_black reformat the code using the black standard and removed format_black reformat the code using the black standard labels Dec 20, 2023
@jan-janssen jan-janssen deleted the combo branch December 21, 2023 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format_black reformat the code using the black standard
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants