I am building a function that creates an aggregate table with certain variables developed from df. I think I got most things right, but I am struggling understanding how to count a number of distinct products bought by the customer in previous year.
Example of my df:
df = pd.DataFrame({'year': [2020, 2021, 2021, 2021, 2022],
'id': [3, 1, 1, 2, 1],
'price': [12, 20, 30, 40, 50],
'age': [20, 30, 30, 44, 31],
'product': [book, toy, book, tv, book],
'quantity': [2,1,2,5,9})
here is my code so far:
df['revenue'] = df.price*df.quantity
df['year'] = pd.DatetimeIndex(df['date']).year
def table(df):
return df.groupby(['year','id'])\
.agg(revenue = ('revenue', 'sum'), age = ('age', 'unique'), product_year_before = ('product', 'nunique'))
It counts products correctly, but it needs to be for previous year and not for current year.
CodePudding user response:
For previous year
you can subtract one year in MultiIndex
:
df1 = table(df)
s = df1['product_year_before'].rename(lambda x: x - 1, level=0)
print (s)
year id
2019 3 1
2020 1 2
2 1
2021 1 1
Name: product_year_before, dtype: int64
All together - if need new column use DataFrame.join
with s
Series:
def table(df):
df1 = (df.groupby(['year','id'])
.agg(revenue = ('revenue', 'sum'),
age = ('age', 'unique'),
product_year_before = ('product', 'nunique')))
s = df1['product_year_before'].rename(lambda x: x - 1, level=0)
return df1.drop('product_year_before', axis=1).join(s)
df1 = table(df)
print (df1)
revenue age product_year_before
year id
2020 3 24 [20] NaN
2021 1 80 [30] 1.0
2 200 [44] NaN
2022 1 450 [31] NaN
With original column:
def table(df):
df1 = (df.groupby(['year','id'])
.agg(revenue = ('revenue', 'sum'),
age = ('age', 'unique'),
product_year= ('product', 'nunique')))
s = df1['product_year'].rename(lambda x: x - 1, level=0)
return df1.join(s.rename('product_year_before'))
df1 = table(df)
print (df1)
revenue age product_year product_year_before
year id
2020 3 24 [20] 1 NaN
2021 1 80 [30] 2 1.0
2 200 [44] 1 NaN
2022 1 450 [31] 1 NaN