I've got a pandas df where I've already matched the name to the ID, but there are some IDs that don't have a name. For those, I want to go back to the mapping file and search the 'alternative_ID_list' column and see if there is a match with a corresponding name (ignore the decimal). I also tried something (check very bottom) but it is taking way to long to run.
current df
name ID
0 joe USER1
3 mary USER2
5 USER3 USER3
8 USER4 USER4
9 USER5 USER5
9 USER6 USER6
bad_matches=[3, 4, 5, 6]
mapping_df
name ID alternative_ID_list
0 joe USER1 USER213.32
3 mary USER2 USER643.11
5 sam USER98 USER31.5
7 jack USER992 USER4.2
8 rick USER902 USER5.6, USER321.1
9 john USER979 USER6.8, USER987.9
10 jay USER980 USER479.2, USER989.0
#use mapping_df to find the bad_match_IDs (take the first match found if multiple rows for one bad_match_id)
desired
name ID
0 joe USER1
3 mary USER2
5 USER3 USER3
7 jack USER4
8 rick USER5
9 john USER6
what i've tried (takes way too long): (df2=current_df)
df2=current_df
bad_matches = df2['ID'].loc[df2['name'].isna()]
mapping_df.alternative_ID_list = mapping_df. alternative_ID_list.str.split(',')
mapping_df = mapping_df.explode('alternative_ID_list')
mapping_df. alternative_ID_list = mapping_df. alternative_ID_list.astype(str).str.split('.').str[0].astype(str)
df3 = mapping_df.loc[lambda x: x['alternative_ID_list'].isin(bad_matches)]
df2['name'] = df2['ID'].map(df3.set_index('alternative_ID_list')['name']).fillna(df2['name'])
CodePudding user response:
# Let's clean up a copy of `mapping_df`
alt_df = (mapping_df.drop('ID', axis=1)
.rename(columns={'alternative_ID_list':'ID'}))
alt_df.ID = alt_df.ID.str.split(', ')
alt_df = alt_df.explode('ID')
alt_df.ID = alt_df.ID.str.replace('\.\d ', '', regex=True)
cols = ['ID', 'name']
mask = df.name.eq(df.ID)
df.loc[mask, cols] = (df.loc[mask, ['ID']] # Let's only look at relevant columns
# df.merge is the best way to merge things by specific or like columns.
# `left` means we keep even non-matches from df
.merge(alt_df, on='ID', how='left')
.ffill(axis=1) # Fill `USER#` if we couldn't find a match.
.values) # Use just the values, so it doesn't try to use indices.
print(df)
# Output:
name ID
0 joe USER1
3 mary USER2
5 USER3 USER3
8 jack USER4
9 rick USER5
9 john USER6
So, if all you had was df
:
ID
0 USER1
1 USER2
2 USER3
3 USER4
4 USER5
5 USER6
We could get the names using:
df = df.merge(mapping_df[['ID', 'name']], how='left')
mask = df.name.isna()
df.loc[mask] = df.loc[mask, ['ID']].merge(alt_df, on='ID', how='left').values
print(df)
# Output:
ID name
0 USER1 joe
1 USER2 mary
2 USER3 NaN
3 USER4 jack
4 USER5 rick
5 USER6 john
CodePudding user response:
Main dataframe current_df
print(current_df)
###
name ID
0 joe USER1
1 mary USER2
2 USER3 USER3
3 USER4 USER4
4 USER5 USER5
5 USER6 USER6
Prepare reference dataframe mdf
mapping_df.drop(columns=['ID'], inplace=True)
dd = pd.DataFrame(mapping_df["alternative_ID_list"].str.split(",").tolist()).fillna(np.nan)
for i in range(dd.shape[1]):
mapping_df['alternative_ID_list_' str(i)] = dd[i].str.split(".").str[0]
mapping_df.drop(columns='alternative_ID_list', inplace=True)
mdf = mapping_df.melt(id_vars=['name'],
value_vars=mapping_df.columns[1:],
var_name='alternative_ID_list',
value_name='alternative_ID')
mdf.drop(columns=['alternative_ID_list'], inplace=True)
mdf.dropna(inplace=True)
print(mdf)
###
name alternative_ID
0 joe USER213
1 mary USER643
2 sam USER31
3 jack USER4
4 rick USER5
5 john USER6
6 jay USER479
11 rick USER321
12 john USER987
13 jay USER989
processing
current_df_t = current_df.merge(mdf, left_on='ID', right_on='alternative_ID', how='left')
current_df_t.loc[current_df_t['name_y'].notna(), 'name_x'] = current_df_t['name_y']
current_df.loc[:,['name','ID']] = current_df_t.loc[:,['name_x','ID']].values
print(current_df)
###
name ID
0 joe USER1
1 mary USER2
2 USER3 USER3
3 jack USER4
4 rick USER5
5 john USER6