Home > Net >  Replace column value of Dataframe based on a condition on another Dataframe
Replace column value of Dataframe based on a condition on another Dataframe

Time:03-29

I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.

import pandas as pd
import numpy as np

student = {
    'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
    'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
    'math score': [50, 100, 70, 80, 75, 40],
    'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}

updatedId = {
    'old_id' : ['ok83', '234v'],
    'new_id' : ['83ko', 'v432'],
}

df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)

print(df_student)
print(df_updated_id)

# Method with np.where
for index, row in df_updated_id.iterrows():
    df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'],  df_student['student_Id'])
    
# print(df_student)

# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
   df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'],  row['new_id'], inplace=True)

print(df_student)

The results for both methods above work and yield the correct result

     Name  gender  math score student_Id
0    John    male          50       1234
1     Jay    male         100       6788
2  sachin    male          70        xyz
3  Geetha  female          80       abcd
4  Amutha  female          75       ok83
5  ganesh    male          40       234v

  old_id new_id
0   ok83   83ko
1   234v   v432

     Name  gender  math score student_Id
0    John    male          50       1234
1     Jay    male         100       6788
2  sachin    male          70        xyz
3  Geetha  female          80       abcd
4  Amutha  female          75       83ko
5  ganesh    male          40       v432

Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.

Thus I run into performance issues as loop is very slow:

A simple timer are placed to observe the number of records processed for df_updated_id

100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443

500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145

1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914

My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work. Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas

Please advice..

CodePudding user response:

We can just replace

df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]: 
     Name  gender  math score student_Id
0    John    male          50       1234
1     Jay    male         100       6788
2  sachin    male          70        xyz
3  Geetha  female          80       abcd
4  Amutha  female          75       83ko
5  ganesh    male          40       v432

CodePudding user response:

You can also try with map:

df_student['student_Id'] = (
    df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
                            .fillna(df_student['student_Id'])
)
print(df_student)

# Output
     Name  gender  math score student_Id
0    John    male          50       1234
1     Jay    male         100       6788
2  sachin    male          70        xyz
3  Geetha  female          80       abcd
4  Amutha  female          75       83ko
5  ganesh    male          40       v432

CodePudding user response:

Also, try replace with dictionary comprehension:

df_student.replace({'student_Id':{o:n for o, n in zip(updatedId['old_id'], 
                                                      updatedId['new_id'])}})

Output:

     Name  gender  math score student_Id
0    John    male          50       1234
1     Jay    male         100       6788
2  sachin    male          70        xyz
3  Geetha  female          80       abcd
4  Amutha  female          75       83ko
5  ganesh    male          40       v432
  • Related