I'm doing a python code for data analysis. I would like to mark the lines, in a new column, that have the same value in EMP, RAZAO and ATRIB columns and add the values in MONTCALC is zero. For exemple:
In this image the lines marked with color are subgroup and if you add the values of MONTCALC column the result is 0.
My code:
conciliation_df_temp = conciliation_df.copy()
doc_clear = 1
for i in conciliation_df_temp.index:
if conciliation_df_temp.loc[i,'DOC_COMP'] == "":
company = conciliation_df_temp.loc[i,'EMP']
gl_account = conciliation_df_temp.loc[i,'RAZAO']
assignment = conciliation_df_temp.loc[i,'ATRIB']
df_temp = conciliation_df_temp.loc[(cconciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment)]
if round(df_temp['MONTCALC'].sum(),2) == 0:
conciliation_df_temp.loc[(conciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment),'DOC_COMP'] = doc_clear
doc_clear = 1
The performance with few lines (10,000) is good execute less than 1 minute. In the 1 minute also has read a text file, file handling and convertion to dataframe. But if I put a text file with more than 1 million lines the script does't execute, I wait 5 hours with out return.
What do I do to improve performance this code?
Regards!!
Sorry my English
I tried delete lines in dataFrame to decrease size dataFrame to search be faster, but the execution was slower.
CodePudding user response:
It seems like you could just check if the sum of the group is zero:
import pandas as pd
df = pd.DataFrame([
[3000,1131500040,8701731701,-156002.08],
[3000,1131500040,8701731701, 156002.08],
[3000,1131500040,"EA-17012.2.22", -3990],
[3000,1131500040,"EA-17012.2.22", 400],
[3000,1131500040,"000100000103", -35822.86],
[3000,1131500040,"000100000103", 35822.86],
[3000,1131500040,"000100000103", -35822.86],
[3000,1131500040,"000100000103", 35822.86]
], columns=['EMP','RAZAO','ATRIB','MONTCALC']
)
df['zero'] = df.groupby(['EMP','RAZAO','ATRIB'])['MONTCALC'].transform(lambda x: sum(x)==0)
print(df)
Output
EMP RAZAO ATRIB MONTCALC zero
0 3000 1131500040 8701731701 -156002.08 True
1 3000 1131500040 8701731701 156002.08 True
2 3000 1131500040 EA-17012.2.22 -3990.00 False
3 3000 1131500040 EA-17012.2.22 400.00 False
4 3000 1131500040 000100000103 -35822.86 True
5 3000 1131500040 000100000103 35822.86 True
6 3000 1131500040 000100000103 -35822.86 True
7 3000 1131500040 000100000103 35822.86 True