Assume, there are two DataFrame, which are
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'item':['apple', 'orange', 'melon',
'meat', 'milk', 'soda', 'wine'],
'vector':[[12, 31, 45], [21, 14, 56],
[9, 47, 3], [20, 7, 98],
[11, 67, 5], [23, 45, 3],
[8, 9, 33]]})
df2 = pd.DataFrame({'customer':[1,2,3],
'grocery':[['apple', 'soda', 'wine'],
['meat', 'orange'],
['coffee', 'meat', 'milk', 'orange']]})
The outputs of df1 and df2 are
df1
item vector
0 apple [12, 31, 45]
1 orange [21, 14, 56]
2 melon [9, 47, 3]
3 meat [20, 7, 98]
4 milk [11, 67, 5]
5 soda [23, 45, 3]
6 wine [8, 9, 33]
df2
customer grocery
0 1 [apple, soda, wine]
1 2 [meat, orange]
2 3 [coffee, meat, milk, orange]
The goal is to average vectors of each customer's grocery list. If an item does not list in the df1 then use [0, 0, 0]
to represent, thus 'coffee' = [0, 0, 0]
. The final data frame df2 will be like
customer grocery average
0 1 [apple, soda, wine] [14.33, 28.33, 27]
1 2 [meat, orange] [20.5, 10.5, 77]
2 3 [coffee, meat, milk, orange] [13, 22, 39.75]
where customer1 is to average the vectors of apple, soda, and wine. customer3 is to average vectors of coffee, meat, milk and orange, Again, here coffee = [0, 0, 0]
because it is not on df1. Any suggestions? many thanks in advance
CodePudding user response:
This answer may be long-winded and not optimized, but it will serve your purpose.
First of all, you need to check if the items in df2 is in df1 so that you can add the non existing item into df1 along with the 0s.
import itertools
for i in set(itertools.chain.from_iterable(df2['grocery'])):
if i not in list(df1['item']):
df1.loc[len(df1.index)] = [i,[0,0,0]]
Next, you can perform list comprehension to find the average of the list and add it to a new column in df2.
df2['average'] = [np.mean(list(df1.loc[df1['item'].isin(i)]["vector"]),axis=0) for i in df2["grocery"]]
df2
Out[91]:
customer ... average
0 1 ... [14.333333333333334, 28.333333333333332, 27.0]
1 2 ... [20.5, 10.5, 77.0]
2 3 ... [13.0, 22.0, 39.75]
[3 rows x 3 columns]
CodePudding user response:
Can you check if this works? I'll add an explanation if it works.
d2 = df2.explode('grocery')
df2['average'] = d2['grocery'].map(df1.set_index('item')['vector'].map(np.mean)).fillna(0).round(1).groupby(level=0).agg(list)