Home > Software design >  Improve the performance of my python code (I'm using Pandas)
Improve the performance of my python code (I'm using Pandas)

Time:01-12

I'm doing a python code for data analysis. I would like to mark the lines, in a new column, that have the same value in EMP, RAZAO and ATRIB columns and add the values in MONTCALC is zero. For exemple:

Example of datas

In this image the lines marked with color are subgroup and if you add the values of MONTCALC column the result is 0.

My code:

conciliation_df_temp = conciliation_df.copy()
doc_clear = 1

for i in conciliation_df_temp.index:
  if conciliation_df_temp.loc[i,'DOC_COMP'] == "":
    company = conciliation_df_temp.loc[i,'EMP']
    gl_account = conciliation_df_temp.loc[i,'RAZAO']
    assignment = conciliation_df_temp.loc[i,'ATRIB']
    df_temp = conciliation_df_temp.loc[(cconciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment)]

    if round(df_temp['MONTCALC'].sum(),2) == 0:
      conciliation_df_temp.loc[(conciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment),'DOC_COMP'] = doc_clear
      doc_clear  = 1

The performance with few lines (10,000) is good execute less than 1 minute. In the 1 minute also has read a text file, file handling and convertion to dataframe. But if I put a text file with more than 1 million lines the script does't execute, I wait 5 hours with out return.

What do I do to improve performance this code?

Regards!!

Sorry my English

I tried delete lines in dataFrame to decrease size dataFrame to search be faster, but the execution was slower.

CodePudding user response:

It seems like you could just check if the sum of the group is zero:

import pandas as pd

df = pd.DataFrame([
  [3000,1131500040,8701731701,-156002.08],
  [3000,1131500040,8701731701, 156002.08],
  [3000,1131500040,"EA-17012.2.22", -3990],
  [3000,1131500040,"EA-17012.2.22", 400],
  [3000,1131500040,"000100000103", -35822.86],
  [3000,1131500040,"000100000103", 35822.86],
  [3000,1131500040,"000100000103", -35822.86],
  [3000,1131500040,"000100000103", 35822.86]
], columns=['EMP','RAZAO','ATRIB','MONTCALC']
)

df['zero'] = df.groupby(['EMP','RAZAO','ATRIB'])['MONTCALC'].transform(lambda x: sum(x)==0)

print(df)

Output

    EMP       RAZAO          ATRIB   MONTCALC   zero
0  3000  1131500040     8701731701 -156002.08   True
1  3000  1131500040     8701731701  156002.08   True
2  3000  1131500040  EA-17012.2.22   -3990.00  False
3  3000  1131500040  EA-17012.2.22     400.00  False
4  3000  1131500040   000100000103  -35822.86   True
5  3000  1131500040   000100000103   35822.86   True
6  3000  1131500040   000100000103  -35822.86   True
7  3000  1131500040   000100000103   35822.86   True
  • Related