Home > Software design >  Pandas: Retain column entries after inner join even if there are no common values
Pandas: Retain column entries after inner join even if there are no common values

Time:06-28

I have 3 dataframes. I merge df1 and df2 through a common column. However, I need to use df3 to find what values are allowed for pairs seen in groupby created. I could get this part done too using 2-column merge through inner join, but I also need to se the entries that did not have any common elements. So far what I could do is represented with a model problem here:

ch = {'country':['India','India','India','USA','USA','Italy','Italy'],'hotel':['Taj','Oberoi','Hilton','Taj','Hilton','Oberoi','Marriott']}
ch_df = pd.DataFrame.from_dict(ch)

hm = {'hotel':['Taj','Taj','Taj','Oberoi','Oberoi','Marriott','Marriott','Marriott','Hilton','Hilton'],'menu':['ildi','dosa','soup','soup','ildi','soup','pasta','pizza','pizza','burger']}
hm_df = pd.DataFrame.from_dict(hm)

cm = {'country':['India','India','India','USA','USA','USA','Italy','Italy'],'menu':['ildi','dosa','soup','dosa','burger','pizza','pizza','pasta']}
cm_df = pd.DataFrame.from_dict(cm)

chm_df = pd.merge(ch_df, hm_df, left_on='hotel', right_on='hotel')

pd.merge(left=chm_df, right=cm_df, on=['country','menu'], how='inner').groupby(['country','hotel'])['menu'].apply(list).reset_index(name='menu items')
  country     hotel          menu items
0   India    Oberoi        [ildi, soup]
1   India       Taj  [ildi, dosa, soup]
2   Italy  Marriott      [pasta, pizza]
3     USA    Hilton     [pizza, burger]
4     USA       Taj              [dosa]

What I need are entries such as:

5   Italy  Oberoi                    []
...

One inefficient way is to add to each pair in hm_df an allowed menu item and remove it after groupby. But it looks ugly. What is a more elegant method?

CodePudding user response:

If need all possible combinations is possible use DataFrame.unstack withDataFrame.stack, for replace non exist values to empty lists use fill_value=[] parameter:

df = pd.merge(chm_df, cm_df, on=['country','menu']).groupby(['country','hotel'])['menu'].apply(list).unstack(fill_value=[]).stack().reset_index(name='menu items')
    
print (df)
   country     hotel          menu items
0    India    Hilton                  []
1    India  Marriott                  []
2    India    Oberoi        [ildi, soup]
3    India       Taj  [ildi, dosa, soup]
4    Italy    Hilton                  []
5    Italy  Marriott      [pasta, pizza]
6    Italy    Oberoi                  []
7    Italy       Taj                  []
8      USA    Hilton     [pizza, burger]
9      USA  Marriott                  []
10     USA    Oberoi                  []
11     USA       Taj              [dosa]

For completness if need only non exist values from chm_df convert to empty lists:

df = pd.merge(chm_df, cm_df, on=['country','menu']).groupby(['country','hotel'])['menu'].apply(list).reindex(pd.MultiIndex.from_frame(ch_df), fill_value=[]).reset_index(name='menu items')
    
print (df)
  country     hotel          menu items
0   India       Taj  [ildi, dosa, soup]
1   India    Oberoi        [ildi, soup]
2   India    Hilton                  []
3     USA       Taj              [dosa]
4     USA    Hilton     [pizza, burger]
5   Italy    Oberoi                  []
6   Italy  Marriott      [pasta, pizza]

CodePudding user response:

The tables I was working with were too large (~100M) so the approach I initially sought does not work. So I have now made, hopefully, a more better suited approach. Hope someone may find it interesting.

import pandas as pd

ch = {'country':['India','India','India','USA','USA','Italy','Italy'],'hotel':['Taj','Oberoi','Hilton','Taj','Hilton','Oberoi','Marriott']}
ch_df = pd.DataFrame.from_dict(ch)

hm = {'hotel':['Taj','Taj','Taj','Oberoi','Oberoi','Marriott','Marriott','Marriott','Hilton','Hilton'],'menu':['ildi','dosa','soup','soup','ildi','soup','pasta','pizza','pizza','burger']}
hm_df = pd.DataFrame.from_dict(hm)

cm = {'country':['India','India','India','USA','USA','USA','Italy','Italy'],'menu':['ildi','dosa','soup','dosa','burger','pizza','pizza','pasta']}
cm_df = pd.DataFrame.from_dict(cm)

c_dict = cm_df.groupby('country')['menu'].apply(set).to_dict()
h_dict = hm_df.groupby('hotel')['menu'].apply(set).to_dict()

# This won't work if there are missing keys in the dictionary column, as it was in my case
# ch_df['new_col'] = ch_df.apply(lambda x: '|'.join(c_dict[x.country].intersection(h_dict[x.hotel])), axis=1)

def menu_list(a,b):
    if a in c_dict and b in h_dict:
        return '|'.join(c_dict[a].intersection(h_dict[b]))
    return ''

ch_df['menu_list'] = ch_df.apply(lambda x: menu_list(x.country,x.hotel), axis=1)

ch_df
  • Related