Vlookups in Pandas across 2 dataframe column-CodePudding

I have 2 dataframes and I wish to grab IDs matching with DF2 into df1 merged as separate columns. There are multiple columns to be added due to df2 having many different country names for a specific ID.

df1 looks like below:

ID   URL
A    example.com/1
B    example.com/2
C    example.com/3
D    example.com/4

df2 is like this:

ID   country    URL
A     usa       example.com/usa-A
B     uk        example.com/uk-B
C     canada    example.com/canada-C
A     uk        example.com/uk-A
C     usa       example.com/usa-C

What I am expecting df1 to look like:

ID   URL               uk                    USA                  Canada
A    example.com/1     example.com/uk-A      example.com/usa-A    NaN
B    example.com/2     example.com/uk-B      NaN                  NaN
C    example.com/3     NaN                   example.com/usa-C    example.com/canada-C
D    example.com/4     NaN                   NaN                  NaN

I wish to bring if DF1 ID A is found in DF2 ID against a country then bring the country URL up next to df1 ID in a specific country column.

The way I am trying to achieve this is using a for loop with a map call below:

final = pd.DataFrame()
for a in countries_list:
     b = df2.loc[(df2["country"] == a)]
     df1["country"] = df1['id'].map(df2.set_index('id')['url'])
     final = pd.concat([final, df1])

It runs for a certain amount of countries and then start throwing InvalidIndexError: Reindexing only valid with uniquely valued Index objects which I tried to overcome using a reset_index() function on both df1 and df2 but still after a certain amount of iterations, it throws me the same error.

Can someone suggest a more efficient way to do this or any way i could run it over all possible iterations?

Thanks,

CodePudding user response：

Try as follows:

res = df.merge(df2.pivot(index='ID',columns='country',values='URL'), 
               left_on='ID', right_index=True, how='left')

print(res)

  ID            URL                canada                uk                usa
0  A  example.com/1                   NaN  example.com/uk-A  example.com/usa-A
1  B  example.com/2                   NaN  example.com/uk-B                NaN
2  C  example.com/3  example.com/canada-C               NaN  example.com/usa-C
3  D  example.com/4                   NaN               NaN                NaN

Explanation

First, use df.pivot on df2. We get:

print(df2.pivot(index='ID',columns='country',values='URL'))

country                canada                uk                usa
ID                                                                
A                         NaN  example.com/uk-A  example.com/usa-A
B                         NaN  example.com/uk-B                NaN
C        example.com/canada-C               NaN  example.com/usa-C

Next, use df.merge to merge df and pivoted df2, joining the two on ID, and passing left to the how parameter to "use only keys from left frame". (Leave how='left' out, if you are not interested in the row for ID with only NaN values.)

If you're set on a particular column order, use e.g.:

res = res.loc[:,['ID','URL','uk','usa','canada']]

CodePudding user response：

# map df to df2 for URL
df2['URL2']=df2['ID'].map(df.set_index(['ID'])['URL'])

#pivot
(df2.pivot(index=['ID','URL2'], columns='country', values='URL')
 .reset_index()
 .rename_axis(columns=None)
 .rename(columns={'URL2':'URL'}))


    ID  URL2    canada  uk  usa
0   A   example.com/1   NaN     example.com/uk-A    example.com/usa-A
1   B   example.com/2   NaN     example.com/uk-B    NaN
2   C   example.com/3   example.com/canada-C    NaN     example.com/usa-C