Fill in lookup table in Python using pandas dataframe and aggregate function-CodePudding

I have a pandas dataframe containing (shop, product, price) information for different days like this one.

df = pd.DataFrame(data={'day': [1, 2, 3, 1, 2, 3, 2, 4, 5, 2, 4, 5, 2, 4, 5, 2, 4, 5],
                        'shop': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'c'],
                        'product': ['x', 'x', 'x', 'y', 'y', 'y', 'x', 'x', 'x', 'z', 'z', 'z', 'y', 'y', 'y', 'z', 'z', 'z'],
                        'price': [0, 1, 2, 2, 4, 6, 1, 2, 3, 0, 1, 1, 1, 1, 0, 2, 2, 2]})

I want to make a lookup table that gives me, for each (shop, product) combination, the mean price over the two most recent available days.

For example, given shop 'a' and product 'y', the prices are 2, 4 and 6, so the results in the look up table should be the average of 4 and 6, that is to say 5.

The expected results is the following nested dictionary:

{'a': {'x': 1.5, 'y': 5.0},
 'b': {'x': 2.5, 'z': 1.0},
 'c': {'y': 0.5, 'z': 2.0}}

I came up with two solutions.

#Solution 1: nested for loops

lookup = {}
for categ_1 in df['shop'].unique():
    df_1 = df[df['shop'] == categ_1]
    lookup[categ_1] = {}
    for categ_2 in df_1['product'].unique():
        df_2 = df_1[df_1['product'] == categ_2]
        res = df_2.iloc[-2:,:]['price'].mean()
        lookup[categ_1][categ_2] = res

#Solution 2: filter unique combinations of (shop, product) and iterate over them

lookup = {}
for i, row in df[['shop', 'product']].drop_duplicates().iterrows():
    mask = ((df['shop'] == row['shop']) & (df['product'] == row['product']))
    _df = df[mask]
    res = _df.iloc[-2:,:]['price'].mean()
    try:
        lookup[row['shop']].update({row['product']: res})
    except KeyError:
        lookup[row['shop']] = {row['product']: res}

Though I found solution 2 more elegant, solution 1 is faster.

solution 1: 7.57 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
solution 2: 9.3 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

My actual dataframe contains much more columns and more rows to iterate over, so I would like, if possible, to avoid nested for loops as in solution 1 but also a faster solution than solution 2.

I would be glad if you can improve one of the solutions, or find a better solution

Thank you

CodePudding user response：

Pandas has a groupby function that is ideal for this.

lookup = df.groupby(['shop', 'product'])

This gives you a groupby object. The challenge then is how to aggregate your columns based on the last two days, since that isn't inherently built into GroupBy. You could create a lambda function to grab the last two items in each list of prices and calculate the average.

df.sort_values(by='day', ascending=True, inplace=True)
avg_func = lambda x: sum(x[-2:])/2
lookup = df.groupby(['shop', 'product']).agg({'price': avg_func})

Output:

shop product       
a    x          1.5
     y          5.0
b    x          2.5
     z          1.0
c    y          0.5
     z          2.0