Home > OS >  Is there a better way to replace the "for" loop in python?
Is there a better way to replace the "for" loop in python?

Time:12-14

Can someone help me?

I have a huge dataframe to work with (90 thousand rows) and I need to apply this logic, but I can only think of solving the problem using a 'for' loop, and this is taking almost 2 hours to run... Can someone give me some light on how I can optimize the code?

The logic is as follows: For each 'Customer', I need to check if his 'FinalKey' exists in the 'Key' column. If it exists, the 'Final Name' of this customer will be the same as the most repeated name in 'Customer' for that same 'FinalKey' in 'Key'. Here is an example below:


Creates the dataframe df

data = [['Tom','123', '123'], ['Tom', '54', '123'], \
    ['Tom', '21', '123'], ['Tom2', '123', '123'], \
    ['Tom3', '123', '123'], ['Tom3', '123', '123'], \
    ['John', '45', '45'], ['Mary', '23', '41']]

df = pd.DataFrame(data, columns=['Customer', 'Key', 'FinalKey'])
df['Final Name']=''

Print dataframe

df


Customer Key FinalKey Final Name
Tom 123 123
Tom 54 123
Tom 21 123
Tom2 123 123
Tom3 123 123
Tom3 123 123
John 45 45
Mary 41 41

Here is the logic:


Logic

for i in range(0, len(df['Customer'])):  

if str(df.loc[i, 'FinalKey']) in list(df['Key']):    
    df.loc[i, 'Final Name'] = df[df['Key']==df.loc[i, 'FinalKey']]['Customer'].value_counts().idxmax()

else:
    df.loc[i, 'Final Name'] = ""

df



| Customer |   Key   | FinalKey | Final Name | 
| -------- | ------- | -------- | ---------- |
|   Tom    |   123   |    123   |    Tom3    |
|   Tom    |    54   |    123   |    Tom3    |
|   Tom    |    21   |    123   |    Tom3    |
|   Tom2   |   123   |    123   |    Tom3    |
|   Tom3   |   123   |    123   |    Tom3    |
|   Tom3   |   123   |    123   |    Tom3    |
|   John   |    45   |     45   |    John    |
|   Mary   |    23   |     41   |            |

CodePudding user response:

I'm sure there's a better way, but this works:

top_cust_by_key = df[['Key', 'Customer']].groupby('Key').agg(lambda x: x.value_counts().index[0])['Customer']
df['Final Name'] = df['FinalKey'].map(top_cust_by_key)
  • Related