How to aggregate between two dataframes in pandas-CodePudding

Good evening.

I have two dataframes, each having several columns and a couple of million lines but right now we're interested only in two of columns: FOOs and IDs for DataFrame1, BARs and IDs for DataFrame2. IDs is basically a many-to-many relationship and it goes like this:

import pandas as pd

data1=['foo1', 'foo2', 'foo1', 'foo1', 'foo3', 'foo2', 'foo3','foo4',
      'foo1', 'foo3', 'foo1', 'foo2', 'foo1', 'foo2', 'foo3']
id1=[1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 7, 7, 7]

data2=['bar1', 'bar1', 'bar1', 'bar2', 'bar3', 'bar1', 'bar1',
      'bar3', 'bar2', 'bar1' ]
id2=[1, 1, 1, 1, 2, 3, 4, 5, 6, 7]

df_foo=pd.DataFrame(data=zip(data1, id1), columns=['FOOs', 'IDs'],
                    dtype='object')
df_bar=pd.DataFrame(data=zip(data2, id2), columns=['BARs', 'IDs'],
                    dtype='object')

What I need to do is to aggregate all the FOOs with the BARs. I have a solution that works but it does look messy:

def my_agg(series):
    return df_bar[df_bar.IDs.isin(series)].groupby('BARs').agg({'BARs': pd.unique})

print(df_foo.groupby('FOOs').agg({'FOOs': pd.unique, 'IDs': my_agg}).values)

And the output is:

[[array(['foo1'], dtype=object)
  array([[array(['bar1'], dtype=object)],
         [array(['bar2'], dtype=object)],
         [array(['bar3'], dtype=object)]], dtype=object)]
 [array(['foo2'], dtype=object)
  array([[array(['bar1'], dtype=object)],
         [array(['bar2'], dtype=object)]], dtype=object)]
 [array(['foo3'], dtype=object)
  array([[array(['bar1'], dtype=object)],
         [array(['bar3'], dtype=object)]], dtype=object)]
 [array(['foo4'], dtype=object)
  array([[array(['bar1'], dtype=object)]], dtype=object)]]

Question is: Is there a way to make a clean solution with nice readable output like

FOOs    BARs
foo1    bar1
        bar2
        bar3
foo2    bar1
        bar2
foo3    bar1
        bar3
foo4    bar1

Thanks in advance.

CodePudding user response：

How about:

df_foo.merge(df_bar, on='IDs')[['FOOs', 'BARs']].drop_duplicates()

CodePudding user response：

merge and pivot may work, - not sure about speed.

df_out = pd.merge(df_foo, df_bar, on='IDs').pivot_table(index=['FOOs', 'BARs'])

out:

               IDs
FOOs BARs          
foo1 bar1  2.800000
     bar2  1.000000
     bar3  3.000000
foo2 bar1  2.600000
     bar2  3.500000
foo3 bar1  4.666667
     bar3  2.000000
foo4 bar1  3.000000