I have 2 pandas dataframes (dfA, dfB) with 2 columns each (gender, first name). dfA are data to be cleaned (bad first name / gender) by looking for the right value in dfB. Below is my code which works but is extremely slow for millions of pieces of data. Is there a way to do it faster? (without using a database or other) thank you
for rowIndex in range(len(dfA)):
firstname = dfA.loc[rowIndex,'firstname']
try:
dfA.loc[rowIndex,'genderNew'] = dfB.loc[dfB['firstname'] == firstname].gender.values[0]
except Exception as e:
dfA.loc[rowIndex,'genderNew'] = "unknown"
CodePudding user response:
This should get the job done much more efficiently:
dfA.merge(dfB, on='firstname', how='left').fillna('unknown')