I have two dataframes as:
df_A:
{'last_name': {0: 'Williams', 1: 'Henry', 2: 'XYX', 3: 'Smith', 4: 'David', 5: 'Freeman', 6: 'Walter', 7: 'Test_A', 8: 'Mallesham', 9: 'Mallesham', 10: 'Henry', 11: 'Smith'}, 'first_name': {0: 'Henry', 1: 'Williams', 2: 'ABC', 3: 'David', 4: 'Smith', 5: 'Walter', 6: 'Freeman', 7: 'Test_B', 8: 'Yamulla', 9: 'Yamulla', 10: 'Williams', 11: 'David'}, 'full_name': {0: 'Williams Henry', 1: 'Henry Williams', 2: 'XYX ABC', 3: 'Smith David', 4: 'David Smith', 5: 'Freeman Walter', 6: 'Walter Freeman', 7: 'Test_A Test_B', 8: 'Mallesham Yamulla', 9: 'Mallesham Yamulla', 10: 'Henry Williams', 11: 'Smith David'}, 'name_unique_identifier': {0: 'NAME_GROUP-11', 1: 'NAME_GROUP-11', 2: 'NAME_GROUP-12', 3: 'NAME_GROUP-13', 4: 'NAME_GROUP-13', 5: 'NAME_GROUP-14', 6: 'NAME_GROUP-14', 7: 'NAME_GROUP-15', 8: 'NAME_GROUP-16', 9: 'NAME_GROUP-16', 10: 'NAME_GROUP-11', 11: 'NAME_GROUP-13'}}
last_name first_name full_name name_unique_identifier
0 Williams Henry Williams Henry NAME_GROUP-11
1 Henry Williams Henry Williams NAME_GROUP-11
2 XYX ABC XYX ABC NAME_GROUP-12
3 Smith David Smith David NAME_GROUP-13
4 David Smith David Smith NAME_GROUP-13
5 Freeman Walter Freeman Walter NAME_GROUP-14
6 Walter Freeman Walter Freeman NAME_GROUP-14
7 Test_A Test_B Test_A Test_B NAME_GROUP-15
8 Mallesham Yamulla Mallesham Yamulla NAME_GROUP-16
9 Mallesham Yamulla Mallesham Yamulla NAME_GROUP-16
10 Henry Williams Henry Williams NAME_GROUP-11
11 Smith David Smith David NAME_GROUP-13
df_B:
{'name_unique_identifier': {0: 'NAME_GROUP-11', 1: 'NAME_GROUP-13', 2: 'NAME_GROUP-14'}, 'full_name': {0: 'Henry Williams', 1: 'Smith David', 2: 'Freeman Walter'}, 'last_name': {0: 'Henry', 1: 'Smith', 2: 'Freeman'}, 'first_name': {0: 'Williams', 1: 'David', 2: 'Walter'}}
name_unique_identifier full_name last_name first_name
0 NAME_GROUP-11 Henry Williams Henry Williams
1 NAME_GROUP-13 Smith David Smith David
2 NAME_GROUP-14 Freeman Walter Freeman Walter
Here wherever the name_unique_identifier
exists in df_A
and df_B
, df_A
dataframe column's last_name,first_name
to be filled in with df_B
last_name,first_name
, the non matched entries not required to be updated.
Example:
NAME_GROUP-14
exists in df_A
and df_B
. So last_name
and first_name
in df_A
for this identifier should be as 'Freeman','Walter'.
As I'm dealing with millions of records, an efficient technique is needed.
CodePudding user response:
You can check each unique value in column=name_unique_identifier
from df_B
where exist in df_A
and then insert the value from df_B
to df_A
.
col = 'name_unique_identifier'
for val in df_B[col]:
msk_A = df_A[col].eq(val)
msk_B = df_B[col].eq(val)
df_A.loc[msk, ['last_name', 'first_name']] = df_B.loc[msk_B, ['last_name', 'first_name']].values
# If you want to update 'full_name' base new values of 'last_name' and 'first_name'
df_A['full_name'] = df_A['last_name'] " " df_A['first_name']
print(df_A)
last_name first_name full_name name_unique_identifier
0 Williams Henry Williams Henry NAME_GROUP-11
1 Henry Williams Henry Williams NAME_GROUP-11
2 XYX ABC XYX ABC NAME_GROUP-12
3 Smith David Smith David NAME_GROUP-13
4 David Smith David Smith NAME_GROUP-13
5 Freeman Walter Freeman Walter NAME_GROUP-14
6 Freeman Walter Freeman Walter NAME_GROUP-14
7 Test_A Test_B Test_A Test_B NAME_GROUP-15
8 Mallesham Yamulla Mallesham Yamulla NAME_GROUP-16
9 Mallesham Yamulla Mallesham Yamulla NAME_GROUP-16
10 Henry Williams Henry Williams NAME_GROUP-11
11 Smith David Smith David NAME_GROUP-13
CodePudding user response:
This pandas solution might work for you:
df_A = pd.DataFrame({'last_name': {0: 'Williams', 1: 'Henry', 2: 'XYX', 3: 'Smith', 4: 'David', 5: 'Freeman', 6: 'Walter', 7: 'Test_A', 8: 'Mallesham', 9: 'Mallesham', 10: 'Henry', 11: 'Smith'}, 'first_name': {0: 'Henry', 1: 'Williams', 2: 'ABC', 3: 'David', 4: 'Smith', 5: 'Walter', 6: 'Freeman', 7: 'Test_B', 8: 'Yamulla', 9: 'Yamulla', 10: 'Williams', 11: 'David'}, 'full_name': {0: 'Williams Henry', 1: 'Henry Williams', 2: 'XYX ABC', 3: 'Smith David', 4: 'David Smith', 5: 'Freeman Walter', 6: 'Walter Freeman', 7: 'Test_A Test_B', 8: 'Mallesham Yamulla', 9: 'Mallesham Yamulla', 10: 'Henry Williams', 11: 'Smith David'}, 'name_unique_identifier': {0: 'NAME_GROUP-11', 1: 'NAME_GROUP-11', 2: 'NAME_GROUP-12', 3: 'NAME_GROUP-13', 4: 'NAME_GROUP-13', 5: 'NAME_GROUP-14', 6: 'NAME_GROUP-14', 7: 'NAME_GROUP-15', 8: 'NAME_GROUP-16', 9: 'NAME_GROUP-16', 10: 'NAME_GROUP-11', 11: 'NAME_GROUP-13'}})
df_B = pd.DataFrame({'name_unique_identifier': {0: 'NAME_GROUP-11', 1: 'NAME_GROUP-13', 2: 'NAME_GROUP-14'}, 'full_name': {0: 'Henry Williams', 1: 'Smith David', 2: 'Freeman Walter'}, 'last_name': {0: 'Henry', 1: 'Smith', 2: 'Freeman'}, 'first_name': {0: 'Williams', 1: 'David', 2: 'Walter'}})
df_A.update(pd.merge(df_A, df_B, how='left', on='name_unique_identifier').drop(['last_name_x', 'first_name_x', 'full_name_x'], axis=1).rename(columns={'full_name_y':'full_name', 'last_name_y':'last_name', 'first_name_y':'first_name'}))
print(df_A)
Result:
last_name first_name full_name name_unique_identifier
0 Henry Williams Henry Williams NAME_GROUP-11
1 Henry Williams Henry Williams NAME_GROUP-11
2 XYX ABC XYX ABC NAME_GROUP-12
3 Smith David Smith David NAME_GROUP-13
4 Smith David Smith David NAME_GROUP-13
5 Freeman Walter Freeman Walter NAME_GROUP-14
6 Freeman Walter Freeman Walter NAME_GROUP-14
7 Test_A Test_B Test_A Test_B NAME_GROUP-15
8 Mallesham Yamulla Mallesham Yamulla NAME_GROUP-16
9 Mallesham Yamulla Mallesham Yamulla NAME_GROUP-16
10 Henry Williams Henry Williams NAME_GROUP-11
11 Smith David Smith David NAME_GROUP-13