Improve the performance of my python code (I'm using Pandas)-CodePudding

I'm doing a python code for data analysis. I would like to mark the lines, in a new column, that have the same value in EMP, RAZAO and ATRIB columns and add the values in MONTCALC is zero. For exemple:

Example of datas

In this image the lines marked with color are subgroup and if you add the values of MONTCALC column the result is 0.

My code:

conciliation_df_temp = conciliation_df.copy()
doc_clear = 1

for i in conciliation_df_temp.index:
  if conciliation_df_temp.loc[i,'DOC_COMP'] == "":
    company = conciliation_df_temp.loc[i,'EMP']
    gl_account = conciliation_df_temp.loc[i,'RAZAO']
    assignment = conciliation_df_temp.loc[i,'ATRIB']
    df_temp = conciliation_df_temp.loc[(cconciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment)]

    if round(df_temp['MONTCALC'].sum(),2) == 0:
      conciliation_df_temp.loc[(conciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment),'DOC_COMP'] = doc_clear
      doc_clear  = 1

The performance with few lines (10,000) is good execute less than 1 minute. In the 1 minute also has read a text file, file handling and convertion to dataframe. But if I put a text file with more than 1 million lines the script does't execute, I wait 5 hours with out return.

What do I do to improve performance this code?

Regards!!

Sorry my English

I tried delete lines in dataFrame to decrease size dataFrame to search be faster, but the execution was slower.

CodePudding user response：

It seems like you could just check if the sum of the group is zero:

import pandas as pd

df = pd.DataFrame([
  [3000,1131500040,8701731701,-156002.08],
  [3000,1131500040,8701731701, 156002.08],
  [3000,1131500040,"EA-17012.2.22", -3990],
  [3000,1131500040,"EA-17012.2.22", 400],
  [3000,1131500040,"000100000103", -35822.86],
  [3000,1131500040,"000100000103", 35822.86],
  [3000,1131500040,"000100000103", -35822.86],
  [3000,1131500040,"000100000103", 35822.86]
], columns=['EMP','RAZAO','ATRIB','MONTCALC']
)

df['zero'] = df.groupby(['EMP','RAZAO','ATRIB'])['MONTCALC'].transform(lambda x: sum(x)==0)

print(df)

Output

    EMP       RAZAO          ATRIB   MONTCALC   zero
0  3000  1131500040     8701731701 -156002.08   True
1  3000  1131500040     8701731701  156002.08   True
2  3000  1131500040  EA-17012.2.22   -3990.00  False
3  3000  1131500040  EA-17012.2.22     400.00  False
4  3000  1131500040   000100000103  -35822.86   True
5  3000  1131500040   000100000103   35822.86   True
6  3000  1131500040   000100000103  -35822.86   True
7  3000  1131500040   000100000103   35822.86   True