Home > Software engineering >  optimize for loop in python
optimize for loop in python

Time:11-09

I have 2 pandas dataframes (dfA, dfB) with 2 columns each (gender, first name). dfA are data to be cleaned (bad first name / gender) by looking for the right value in dfB. Below is my code which works but is extremely slow for millions of pieces of data. Is there a way to do it faster? (without using a database or other) thank you

for rowIndex in range(len(dfA)):
    firstname = dfA.loc[rowIndex,'firstname']
    try:
        dfA.loc[rowIndex,'genderNew'] = dfB.loc[dfB['firstname'] == firstname].gender.values[0]
    except Exception as e:
        dfA.loc[rowIndex,'genderNew'] = "unknown"

CodePudding user response:

This should get the job done much more efficiently:

dfA.merge(dfB, on='firstname', how='left').fillna('unknown')
  • Related