I am supposed to aggregate a table of sales by year and id, then also add an age column to the table. Unfortunately when I add the last column, I keep getting NaN values for that column only. Like so:
|year|id|sales|age|
__________________
2022 |1 | 200| NaN|
|2 | 342| NaN|
2021 |34| 500| NaN|
|10| 20 | NaN|
|7 | 4200| Nan|
I have an "age" column in my df, so each id has an age assigned to it. Here is my code:
df['sales'] = df.price*df.quantity
df['year'] = pd.DatetimeIndex(df['date']).year
def table(df):
test = order.groupby(['year','id'])\
.agg(sales = ('sales', 'sum'))\
.assign(age = df['age'])
return table
table(df)
Could someone please let me know why it does not show the age the way it is supposed to?
CodePudding user response:
It gives NaN
values as it does not understand what you want. You are assigning a column to a smaller dataframe as you are grouping by year
and id
. So the question is, what do you want to happen to the column age
?
With my dummy dataset:
df = pd.DataFrame({'year': [2021, 2021, 2021, 2022],
'id': [1, 1, 2, 1],
'sales': [20, 30, 40, 50],
'age': [3, 4, 5, 6]})
Average age
If you want to sum or take the average of the column age, do a similar thing as you did with sales
:
df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'mean'))
Output:
sales age
year id
2021 1 50 3.5
2 40 5.0
2022 1 50 6.0
Give all unique values of age
Do you want to get a list of all values of age, use the argument unique
instead of mean
:
df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'unique'))
Output:
sales age
year id
2021 1 50 [3, 4]
2 40 [5]
2022 1 50 [6]
Single value for age
If the age is the same within a year/id group, you can take the minimum (or maximum, it does not matter as they are the same). Note that in my dummy dataset they are not the same:
df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'min'))
Output:
sales age
year id
2021 1 50 3
2 40 5
2022 1 50 6