optimizing pandas dataframe mapping method-CodePudding

I've got a pandas df where I've already matched the name to the ID, but there are some IDs that don't have a name. For those, I want to go back to the mapping file and search the 'alternative_ID_list' column and see if there is a match with a corresponding name (ignore the decimal). I also tried something (check very bottom) but it is taking way to long to run.

current df

    name     ID
0    joe  USER1
3   mary  USER2
5  USER3  USER3
8  USER4  USER4
9  USER5  USER5
9  USER6  USER6   

bad_matches=[3, 4, 5, 6]

mapping_df

    name       ID   alternative_ID_list
0    joe    USER1            USER213.32
3   mary    USER2            USER643.11
5    sam   USER98              USER31.5
7   jack  USER992               USER4.2
8   rick  USER902    USER5.6, USER321.1
9   john  USER979    USER6.8, USER987.9
10   jay  USER980  USER479.2, USER989.0

#use mapping_df to find the bad_match_IDs (take the first match found if multiple rows for one bad_match_id)

desired 

    name     ID
0    joe  USER1
3   mary  USER2
5  USER3  USER3
7   jack  USER4
8   rick  USER5
9   john  USER6

what i've tried (takes way too long): (df2=current_df)

df2=current_df
bad_matches = df2['ID'].loc[df2['name'].isna()]

mapping_df.alternative_ID_list = mapping_df. alternative_ID_list.str.split(',')
mapping_df = mapping_df.explode('alternative_ID_list')

mapping_df. alternative_ID_list = mapping_df. alternative_ID_list.astype(str).str.split('.').str[0].astype(str)

df3 = mapping_df.loc[lambda x: x['alternative_ID_list'].isin(bad_matches)]

df2['name'] = df2['ID'].map(df3.set_index('alternative_ID_list')['name']).fillna(df2['name'])

CodePudding user response：

# Let's clean up a copy of `mapping_df`
alt_df = (mapping_df.drop('ID', axis=1)
                    .rename(columns={'alternative_ID_list':'ID'}))
alt_df.ID = alt_df.ID.str.split(', ')
alt_df = alt_df.explode('ID')
alt_df.ID = alt_df.ID.str.replace('\.\d ', '', regex=True)

cols = ['ID', 'name']
mask = df.name.eq(df.ID)

df.loc[mask, cols] = (df.loc[mask, ['ID']] # Let's only look at relevant columns
                        # df.merge is the best way to merge things by specific or like columns. 
                        # `left` means we keep even non-matches from df
                        .merge(alt_df, on='ID', how='left')
                        .ffill(axis=1) # Fill `USER#` if we couldn't find a match.
                        .values) # Use just the values, so it doesn't try to use indices.
print(df)

# Output:

    name     ID
0    joe  USER1
3   mary  USER2
5  USER3  USER3
8   jack  USER4
9   rick  USER5
9   john  USER6

So, if all you had was df:

      ID
0  USER1
1  USER2
2  USER3
3  USER4
4  USER5
5  USER6

We could get the names using:

df = df.merge(mapping_df[['ID', 'name']], how='left')
mask = df.name.isna()
df.loc[mask] = df.loc[mask, ['ID']].merge(alt_df, on='ID', how='left').values
print(df)

# Output:

      ID  name
0  USER1   joe
1  USER2  mary
2  USER3   NaN
3  USER4  jack
4  USER5  rick
5  USER6  john

CodePudding user response：

Main dataframe current_df

print(current_df)
###
    name     ID
0    joe  USER1
1   mary  USER2
2  USER3  USER3
3  USER4  USER4
4  USER5  USER5
5  USER6  USER6

Prepare reference dataframe mdf

mapping_df.drop(columns=['ID'], inplace=True)
dd = pd.DataFrame(mapping_df["alternative_ID_list"].str.split(",").tolist()).fillna(np.nan)

for i in range(dd.shape[1]):
    mapping_df['alternative_ID_list_' str(i)] = dd[i].str.split(".").str[0]

mapping_df.drop(columns='alternative_ID_list', inplace=True)
mdf = mapping_df.melt(id_vars=['name'],
                      value_vars=mapping_df.columns[1:],
                      var_name='alternative_ID_list',
                      value_name='alternative_ID')
mdf.drop(columns=['alternative_ID_list'], inplace=True)
mdf.dropna(inplace=True)
print(mdf)
###
    name alternative_ID
0    joe        USER213
1   mary        USER643
2    sam         USER31
3   jack          USER4
4   rick          USER5
5   john          USER6
6    jay        USER479
11  rick        USER321
12  john        USER987
13   jay        USER989

processing

current_df_t = current_df.merge(mdf, left_on='ID', right_on='alternative_ID', how='left')
current_df_t.loc[current_df_t['name_y'].notna(), 'name_x'] = current_df_t['name_y']
current_df.loc[:,['name','ID']] = current_df_t.loc[:,['name_x','ID']].values
print(current_df)
###
    name     ID
0    joe  USER1
1   mary  USER2
2  USER3  USER3
3   jack  USER4
4   rick  USER5
5   john  USER6