Performance optimization in pandas dataframe transormation currently using "for loop"-CodePudding

I usually use python for research, but it is my first time handling a large dataset (over a hundred millions of lines broken into multiple files), and an old but good workstation (Xeon E5-2637 v4 CPU, Quadro K420 GPU).

Any help with speeding up the algorithm below would be greatly appreciated. I am currently looking at enhancing performance to maximize hardware and using groupby to maybe change my for loop code, but to no avail. I have also viewed previous questions, but I (believe) what I need is more elementary.

Data format is as follows. (same file format for all files)

C:/../data1.csv
--
  col1  col2  col3
parent abcde   NaN
 child   d3d   a1a
 child   s2s   f4f
parent fghij   NaN
 child   g5g   h6h
 child   j7j   k8k

My original code

#list of file locations
filelist = {'files': ['C:/../data1.csv', 'C:/../data2.csv', 'C:/../data3.csv']}
filelist_df = pd.DataFrame(data=filelist)
filelist_df = filelist_df["files"].str.strip("[]")

#data transformation
column_names=['1', '2', '3', '4']
temp_parent=[]

for i in range(3):
  new_df=pd.DataFrame(columns=column_names)
  data_df=pd.read_csv(filelist_df[i], skiprows=1, names=column_names)
  for j in range(len(data_df)):
    if data_df['1'][j]=='parent':
      temp_parent=data_df['2'][j]
    else:
      data_df['4'][j]=temp_parent
      temp_row=data_df.loc[j,:]
      new_df = new_df.append(temp_row, ignore_index=True)
  new_df.to_csv('C:/../new%d' % i   '.csv', index=False, header=False)
  del new_df, data_df, temp_parent, temp_row

Output (just for data1.csv):

C:/../new0.csv
--
 child   d3d   a1a abcde
 child   s2s   f4f abcde
 child   g5g   h6h fghij
 child   j7j   k8k fghij

CodePudding user response：

If I understand you correctly, you want to create new column with value from parent row, col2 column:

mask = df.col1.eq("parent")

df["col4"] = df.loc[mask, "col2"]
df["col4"] = df["col4"].ffill()
print(df[~mask])

Prints:

    col1 col2 col3   col4
1  child  d3d  a1a  abcde
2  child  s2s  f4f  abcde
4  child  g5g  h6h  fghij
5  child  j7j  k8k  fghij

Input dataframe:

     col1   col2 col3
0  parent  abcde  NaN
1   child    d3d  a1a
2   child    s2s  f4f
3  parent  fghij  NaN
4   child    g5g  h6h
5   child    j7j  k8k