I have 2 dataframes and I wish to grab IDs matching with DF2 into df1 merged as separate columns. There are multiple columns to be added due to df2 having many different country names for a specific ID.
df1 looks like below:
ID URL
A example.com/1
B example.com/2
C example.com/3
D example.com/4
df2 is like this:
ID country URL
A usa example.com/usa-A
B uk example.com/uk-B
C canada example.com/canada-C
A uk example.com/uk-A
C usa example.com/usa-C
What I am expecting df1 to look like:
ID URL uk USA Canada
A example.com/1 example.com/uk-A example.com/usa-A NaN
B example.com/2 example.com/uk-B NaN NaN
C example.com/3 NaN example.com/usa-C example.com/canada-C
D example.com/4 NaN NaN NaN
I wish to bring if DF1 ID A is found in DF2 ID against a country then bring the country URL up next to df1 ID in a specific country column.
The way I am trying to achieve this is using a for loop with a map call below:
final = pd.DataFrame()
for a in countries_list:
b = df2.loc[(df2["country"] == a)]
df1["country"] = df1['id'].map(df2.set_index('id')['url'])
final = pd.concat([final, df1])
It runs for a certain amount of countries and then start throwing InvalidIndexError: Reindexing only valid with uniquely valued Index objects
which I tried to overcome using a reset_index() function on both df1 and df2 but still after a certain amount of iterations, it throws me the same error.
Can someone suggest a more efficient way to do this or any way i could run it over all possible iterations?
Thanks,
CodePudding user response:
Try as follows:
res = df.merge(df2.pivot(index='ID',columns='country',values='URL'),
left_on='ID', right_index=True, how='left')
print(res)
ID URL canada uk usa
0 A example.com/1 NaN example.com/uk-A example.com/usa-A
1 B example.com/2 NaN example.com/uk-B NaN
2 C example.com/3 example.com/canada-C NaN example.com/usa-C
3 D example.com/4 NaN NaN NaN
Explanation
- First, use
df.pivot
ondf2
. We get:
print(df2.pivot(index='ID',columns='country',values='URL'))
country canada uk usa
ID
A NaN example.com/uk-A example.com/usa-A
B NaN example.com/uk-B NaN
C example.com/canada-C NaN example.com/usa-C
- Next, use
df.merge
to mergedf
and pivoteddf2
, joining the two onID
, and passingleft
to thehow
parameter to "use only keys from left frame". (Leavehow='left'
out, if you are not interested in the row forID
with onlyNaN
values.)
If you're set on a particular column order, use e.g.:
res = res.loc[:,['ID','URL','uk','usa','canada']]
CodePudding user response:
# map df to df2 for URL
df2['URL2']=df2['ID'].map(df.set_index(['ID'])['URL'])
#pivot
(df2.pivot(index=['ID','URL2'], columns='country', values='URL')
.reset_index()
.rename_axis(columns=None)
.rename(columns={'URL2':'URL'}))
ID URL2 canada uk usa
0 A example.com/1 NaN example.com/uk-A example.com/usa-A
1 B example.com/2 NaN example.com/uk-B NaN
2 C example.com/3 example.com/canada-C NaN example.com/usa-C