So my question is how I should save a large amount of simulation data to a file using Python (or update new data rows to the existing file).
Lets say I have NN=1000
particles, and I want to save the position and velocity data of each particle (x y z, vx vy vz). The data is in format [x1,y1,z1,vx1,vy1,vz1, x2,y2,z2,vx2,vy2,vz2, ...]
and so on.
Simulation is working well, but I believe the methods I use for saving and keeping these information saved is not really optimal for me.
Pseudo code similar to my code
T_max = 1000 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
ZZ = np.zeros( (iterations, 2 NN*6 ) ) # Here I generate whole data matrix at the beginning.
# ^ might not be the best idea as the system needs to keep everything in memory for the whole time
# So I guess saving could be done in chunks?
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
i = 0
while i < iteration:
T = dt
Z[i 1][0], Z[i 1][1] = T, dt
#Z[i 1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
Z[i 1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
i = 1
# Now the simulation data is basically done, so one would need to save
# This one feels slow, as it takes 181s to save and is size of 1046246KB
np.savetxt('test1.txt', ZZ)
#other method with a bit less accuracy as I don't need to have all decimals saved
np.savetxt('test2.txt', ZZ, fmt='%1.6f') # Takes 125s and size is 426698KB
# Both of the above are kinda slow so I also tried to save to npy format
np.save('test.npy', ZZ) # It took 8.9s and size 164118KB
so this np.save()
method seems to be fast, but I read somewhere that I can not append data to it. So this would not work if I keep saving the data in parts while calculating new positions.
So back to my question. How should/could I save the data efficiently (fast and memory friendly). I keep having some memory issues when NN
and T_max
gets larger because with this method I keep this whole ZZ
all the time in memory.
So I guess I should calculate ZZ
in parts, i.e. iterations/10
parts but then I should append this data to an existing file, and tests I have made felt slow. Any suggestions?
EDIT: feel free to ask more specifying questions as I feel like I forgot to explain something.
CodePudding user response:
That highly depends on what you intend to use the output for. If it's stored for further calculations, .npy
or some other binary format is always the way to go as it is faster, takes less space, and doesn't lose precision between loads and saves, instead of serializing it into a human readable format. If you need it to be readable, you might as well just output row by row to a csv file or something.
If you want to do it with binary, h5py allows you to extend a dataset after saving and append more stuff to it.
import numpy as np
import h5py
T_max = 10**5 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
chunk_size = 10**5
ZZ = np.zeros( (chunk_size, 2 NN*6 ) )
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
with h5py.File("test.h5", "a") as f:
dset = f.create_dataset('ZZ', (0,2 NN*6), maxshape=(None,2 NN*6), dtype='i8', chunks=(chunk_size,2 NN 6))
for chunk in range(0, iterations, chunk_size):
for i in range(1, chunk_size - 1):
T = dt
ZZ[i 1][0], ZZ[i 1][1] = T, dt
#Z[i 1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[i 1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
# Expand the file here to allow for more data.
dset.resize(dset.shape[0] chunk_size, axis=0)
dset[chunk * chunk_size: (chunk 1) * chunk_size] = ZZ
# update and initialize next chunk. the next chunk's first row should be the last row of the previous chunk iteration
T = dt
ZZ[0][0], ZZ[0][1] = T, dt
#Z[0][2:] = rk4(EOM_function, posvel=Z[-1][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[0][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
print(dset.shape)
This takes 70 seconds on the save step on my computer, generating a 45GB file, for a dataset that is 100 times your original code.
The above code is more general in case you are streaming your data and don't know your final size. If you know it from the start, you can replace the initial create_dataset with
dset = f.create_dataset('ZZ', (iterations,2 NN*6), dtype='i8')
and remove the dset.resize(dset.shape[0] chunk_size, axis=0)
You'll probably also want to read it back in chunks afterwards for other processing, in which case you can follow the docs here: https://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data