I usually use python for research, but it is my first time handling a large dataset (over a hundred millions of lines broken into multiple files), and an old but good workstation (Xeon E5-2637 v4 CPU, Quadro K420 GPU).
Any help with speeding up the algorithm below would be greatly appreciated. I am currently looking at enhancing performance to maximize hardware and using groupby to maybe change my for loop code, but to no avail. I have also viewed previous questions, but I (believe) what I need is more elementary.
Data format is as follows. (same file format for all files)
C:/../data1.csv
--
col1 col2 col3
parent abcde NaN
child d3d a1a
child s2s f4f
parent fghij NaN
child g5g h6h
child j7j k8k
My original code
#list of file locations
filelist = {'files': ['C:/../data1.csv', 'C:/../data2.csv', 'C:/../data3.csv']}
filelist_df = pd.DataFrame(data=filelist)
filelist_df = filelist_df["files"].str.strip("[]")
#data transformation
column_names=['1', '2', '3', '4']
temp_parent=[]
for i in range(3):
new_df=pd.DataFrame(columns=column_names)
data_df=pd.read_csv(filelist_df[i], skiprows=1, names=column_names)
for j in range(len(data_df)):
if data_df['1'][j]=='parent':
temp_parent=data_df['2'][j]
else:
data_df['4'][j]=temp_parent
temp_row=data_df.loc[j,:]
new_df = new_df.append(temp_row, ignore_index=True)
new_df.to_csv('C:/../new%d' % i '.csv', index=False, header=False)
del new_df, data_df, temp_parent, temp_row
Output (just for data1.csv):
C:/../new0.csv
--
child d3d a1a abcde
child s2s f4f abcde
child g5g h6h fghij
child j7j k8k fghij
CodePudding user response:
If I understand you correctly, you want to create new column with value from parent
row, col2
column:
mask = df.col1.eq("parent")
df["col4"] = df.loc[mask, "col2"]
df["col4"] = df["col4"].ffill()
print(df[~mask])
Prints:
col1 col2 col3 col4
1 child d3d a1a abcde
2 child s2s f4f abcde
4 child g5g h6h fghij
5 child j7j k8k fghij
Input dataframe:
col1 col2 col3
0 parent abcde NaN
1 child d3d a1a
2 child s2s f4f
3 parent fghij NaN
4 child g5g h6h
5 child j7j k8k