-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MACE-OFF dataset #332
MACE-OFF dataset #332
Conversation
This is ready for review. It is really slow to preprocess, about 40 minutes. The Dataset comes in an XYZ file which I am processing with ase. I do not know how to speed it up. |
import tarfile
from moleculekit.periodictable import periodictable
def parse_xyz(xyz_file):
import re
energy_re = re.compile("energy=(\S+)")
with tarfile.open(xyz_file, "r:gz") as tar:
for member in tar.getmembers():
f = tar.extractfile(member)
if f is None:
continue
n_atoms = None
counter = 0
positions = []
numbers = []
forces = []
energy = None
for line in f:
line = line.decode("utf-8").strip()
if n_atoms is None:
n_atoms = int(line)
positions = []
numbers = []
forces = []
energy = None
counter = 1
continue
if counter == 1:
props = line
energy = float(energy_re.search(props).group(1))
counter = 2
continue
el, x, y, z, fx, fy, fz, _, _, _ = line.split()
numbers.append(periodictable[el].number)
positions.append([float(x), float(y), float(z)])
forces.append([float(fx), float(fy), float(fz)])
counter += 1
if counter == n_atoms + 2:
n_atoms = None
yield energy, numbers, positions, forces I wrote an xyz parser for the MACE dataset. You can use it with: gen = parse_xyz("./train_large_neut_no_bad_clean.tar.gz")
x = next(gen)
x = next(gen) First call takes a small while to extract, then it goes super fast (around 60μs per call for me) |
Takes 1 minute total to parse the whole file (excluding the initial extraction cost which is like 10-20s) |
Works great @stefdoerr, thanks. Please review again! |
I added a Dataset class for the dataset used in the work "MACE-OFF23: Transferable Machine Learning
Force Fields for Organic Molecules" https://arxiv.org/pdf/2312.15211