I'm relatively new to python and programming and just use it for the analysis of simulation data. I have a directory "result_1/" with over 150000 CSV files with simulation data I want to concat into one pandas-dataFrame. To evade problems with readdir() only reading 32K of directory entries at a time, I prepared "files.csv" - listing all the files in the directory.
("sim", "det", and "run" are pieces of information I read from the filenames and insert as Series into the dataFrame. For better overlook, I took their definition out of the concat.)
My problem is as follows: The program takes too much time to run and I would like to use multiprocessing/-threading to speed up the for-loop, but as I never used mp/mt before, I don't even know if or how it may be used here.
Thank you in advance and have a great day!
import numpy as np
import pandas as pd
import os
import multiprocessing as mp
df = pd.DataFrame()
path = 'result_1/'
list = pd.read_csv('files.csv', encoding='utf_16_le', names=['f'])['f'].values.tolist()
for file in list:
dftemp = pd.read_csv(r'{}'.format(os.path.join(path, file)), skiprows=8, names=['x', 'y', 'z', 'dos'], sep=',').drop(['y', 'z'], axis=1)
sim = pd.Series(int(file.split('Nr')[1].split('_')[0]) * np.ones((300,), dtype=int))
det = pd.Series(int(file.split('Nr')[0]) * np.ones((300,), dtype=int))
run = pd.Series(int(file[-8:-4]) * np.ones((300,), dtype=int))
dftemp = pd.concat([sim, det, run, dftemp], axis=1)
df = pd.concat([df, dftemp], axis=0)
df.rename({0:'sim', 1:'det', 2:'run', 3:'x', 4:'dos'}, axis=1).to_csv(r'df.csv')
The CSV files look like this: "193Nr6_Run_0038.csv" (f.e.)
#(8 lines of things I don't need.)
0, 0, 0, 4.621046656438921e-09
1, 0, 0, 4.600856584602298e-09
(... 300 lines of data [x, y, z, dose])
CodePudding user response:
Processing DataFrames in parallel can be difficult due to CPU and RAM limitations. I don't know the specs of your hardware nor the details of your DataFrames. However, I would use multiprocessing to "parse/make" the DataFrames, and then concatenate them afterwards. Here is an example:
import numpy as np
import pandas as pd
import os
from multiprocessing import Pool
path = 'result_1/'
list_of_files = pd.read_csv('files.csv', encoding='utf_16_le', names=['f'])['f'].values.tolist()
#make a function to replace the for-loop:
def my_custom_func(file):
dftemp = pd.read_csv(r'{}'.format(os.path.join(path, file)), skiprows=8, names=['x', 'y', 'z', 'dos'], sep=',').drop(['y', 'z'], axis=1)
sim = pd.Series(int(file.split('Nr')[1].split('_')[0]) * np.ones((300,), dtype=int))
det = pd.Series(int(file.split('Nr')[0]) * np.ones((300,), dtype=int))
run = pd.Series(int(file[-8:-4]) * np.ones((300,), dtype=int))
return pd.concat([sim, det, run, dftemp], axis=1)
#use multiprocessing to process multiple files at once
with Pool(8) as p: #8 processes simultaneously. Avoid using more processes than cores in your CPU
dataframes = p.map(my_custom_func, list_of_files)
#Finally, concatenate them all
df = pd.concat(dataframes)
df.rename({0:'sim', 1:'det', 2:'run', 3:'x', 4:'dos'}, axis=1).to_csv(r'df.csv')
Have a look at multiprocessing.Pool() for more info.