Average vectors between two pandas DataFrames-CodePudding

Assume, there are two DataFrame, which are

import pandas as pd
import numpy as np 

df1 = pd.DataFrame({'item':['apple', 'orange', 'melon',
                            'meat', 'milk', 'soda', 'wine'],
                    'vector':[[12, 31, 45], [21, 14, 56], 
                              [9, 47, 3], [20, 7, 98], 
                              [11, 67, 5], [23, 45, 3],
                              [8, 9, 33]]})

df2 = pd.DataFrame({'customer':[1,2,3],
                    'grocery':[['apple', 'soda', 'wine'],
                               ['meat', 'orange'],
                               ['coffee', 'meat', 'milk', 'orange']]})

The outputs of df1 and df2 are

df1
    item    vector
0   apple   [12, 31, 45]
1   orange  [21, 14, 56]
2   melon   [9, 47, 3]
3   meat    [20, 7, 98]
4   milk    [11, 67, 5]
5   soda    [23, 45, 3]
6   wine    [8, 9, 33]

df2
customer    grocery
0   1   [apple, soda, wine]
1   2   [meat, orange]
2   3   [coffee, meat, milk, orange]

The goal is to average vectors of each customer's grocery list. If an item does not list in the df1 then use [0, 0, 0] to represent, thus 'coffee' = [0, 0, 0]. The final data frame df2 will be like

    customer    grocery                  average
0   1   [apple, soda, wine]             [14.33, 28.33, 27]
1   2   [meat, orange]                  [20.5, 10.5, 77]
2   3   [coffee, meat, milk, orange]    [13, 22, 39.75]

where customer1 is to average the vectors of apple, soda, and wine. customer3 is to average vectors of coffee, meat, milk and orange, Again, here coffee = [0, 0, 0] because it is not on df1. Any suggestions? many thanks in advance

CodePudding user response：

This answer may be long-winded and not optimized, but it will serve your purpose.

First of all, you need to check if the items in df2 is in df1 so that you can add the non existing item into df1 along with the 0s.

import itertools

for i in set(itertools.chain.from_iterable(df2['grocery'])):
    if i not in list(df1['item']):
        df1.loc[len(df1.index)] = [i,[0,0,0]]

Next, you can perform list comprehension to find the average of the list and add it to a new column in df2.

df2['average'] = [np.mean(list(df1.loc[df1['item'].isin(i)]["vector"]),axis=0) for i in df2["grocery"]]

df2
Out[91]: 
   customer  ...                                         average
0         1  ...  [14.333333333333334, 28.333333333333332, 27.0]
1         2  ...                              [20.5, 10.5, 77.0]
2         3  ...                             [13.0, 22.0, 39.75]

[3 rows x 3 columns]

CodePudding user response：

Can you check if this works? I'll add an explanation if it works.

d2 = df2.explode('grocery')
df2['average'] = d2['grocery'].map(df1.set_index('item')['vector'].map(np.mean)).fillna(0).round(1).groupby(level=0).agg(list)