I have 3 dataframes. I merge df1
and df2
through a common column. However, I need to use df3
to find what values are allowed for pairs seen in groupby created. I could get this part done too using 2-column merge through inner join, but I also need to se the entries that did not have any common elements. So far what I could do is represented with a model problem here:
ch = {'country':['India','India','India','USA','USA','Italy','Italy'],'hotel':['Taj','Oberoi','Hilton','Taj','Hilton','Oberoi','Marriott']}
ch_df = pd.DataFrame.from_dict(ch)
hm = {'hotel':['Taj','Taj','Taj','Oberoi','Oberoi','Marriott','Marriott','Marriott','Hilton','Hilton'],'menu':['ildi','dosa','soup','soup','ildi','soup','pasta','pizza','pizza','burger']}
hm_df = pd.DataFrame.from_dict(hm)
cm = {'country':['India','India','India','USA','USA','USA','Italy','Italy'],'menu':['ildi','dosa','soup','dosa','burger','pizza','pizza','pasta']}
cm_df = pd.DataFrame.from_dict(cm)
chm_df = pd.merge(ch_df, hm_df, left_on='hotel', right_on='hotel')
pd.merge(left=chm_df, right=cm_df, on=['country','menu'], how='inner').groupby(['country','hotel'])['menu'].apply(list).reset_index(name='menu items')
country hotel menu items
0 India Oberoi [ildi, soup]
1 India Taj [ildi, dosa, soup]
2 Italy Marriott [pasta, pizza]
3 USA Hilton [pizza, burger]
4 USA Taj [dosa]
What I need are entries such as:
5 Italy Oberoi []
...
One inefficient way is to add to each pair in hm_df
an allowed menu item and remove it after groupby. But it looks ugly. What is a more elegant method?
CodePudding user response:
If need all possible combinations is possible use DataFrame.unstack
withDataFrame.stack
, for replace non exist values to empty lists use fill_value=[]
parameter:
df = pd.merge(chm_df, cm_df, on=['country','menu']).groupby(['country','hotel'])['menu'].apply(list).unstack(fill_value=[]).stack().reset_index(name='menu items')
print (df)
country hotel menu items
0 India Hilton []
1 India Marriott []
2 India Oberoi [ildi, soup]
3 India Taj [ildi, dosa, soup]
4 Italy Hilton []
5 Italy Marriott [pasta, pizza]
6 Italy Oberoi []
7 Italy Taj []
8 USA Hilton [pizza, burger]
9 USA Marriott []
10 USA Oberoi []
11 USA Taj [dosa]
For completness if need only non exist values from chm_df
convert to empty lists:
df = pd.merge(chm_df, cm_df, on=['country','menu']).groupby(['country','hotel'])['menu'].apply(list).reindex(pd.MultiIndex.from_frame(ch_df), fill_value=[]).reset_index(name='menu items')
print (df)
country hotel menu items
0 India Taj [ildi, dosa, soup]
1 India Oberoi [ildi, soup]
2 India Hilton []
3 USA Taj [dosa]
4 USA Hilton [pizza, burger]
5 Italy Oberoi []
6 Italy Marriott [pasta, pizza]
CodePudding user response:
The tables I was working with were too large (~100M) so the approach I initially sought does not work. So I have now made, hopefully, a more better suited approach. Hope someone may find it interesting.
import pandas as pd
ch = {'country':['India','India','India','USA','USA','Italy','Italy'],'hotel':['Taj','Oberoi','Hilton','Taj','Hilton','Oberoi','Marriott']}
ch_df = pd.DataFrame.from_dict(ch)
hm = {'hotel':['Taj','Taj','Taj','Oberoi','Oberoi','Marriott','Marriott','Marriott','Hilton','Hilton'],'menu':['ildi','dosa','soup','soup','ildi','soup','pasta','pizza','pizza','burger']}
hm_df = pd.DataFrame.from_dict(hm)
cm = {'country':['India','India','India','USA','USA','USA','Italy','Italy'],'menu':['ildi','dosa','soup','dosa','burger','pizza','pizza','pasta']}
cm_df = pd.DataFrame.from_dict(cm)
c_dict = cm_df.groupby('country')['menu'].apply(set).to_dict()
h_dict = hm_df.groupby('hotel')['menu'].apply(set).to_dict()
# This won't work if there are missing keys in the dictionary column, as it was in my case
# ch_df['new_col'] = ch_df.apply(lambda x: '|'.join(c_dict[x.country].intersection(h_dict[x.hotel])), axis=1)
def menu_list(a,b):
if a in c_dict and b in h_dict:
return '|'.join(c_dict[a].intersection(h_dict[b]))
return ''
ch_df['menu_list'] = ch_df.apply(lambda x: menu_list(x.country,x.hotel), axis=1)
ch_df