NaN when adding column to the aggregated table-CodePudding

I am supposed to aggregate a table of sales by year and id, then also add an age column to the table. Unfortunately when I add the last column, I keep getting NaN values for that column only. Like so:

|year|id|sales|age|
__________________
2022 |1 | 200| NaN|
     |2 | 342| NaN|
2021 |34| 500| NaN|
     |10| 20 | NaN|
     |7 | 4200| Nan|

I have an "age" column in my df, so each id has an age assigned to it. Here is my code:

df['sales'] = df.price*df.quantity
df['year'] = pd.DatetimeIndex(df['date']).year 

def table(df):
    test = order.groupby(['year','id'])\
    .agg(sales = ('sales', 'sum'))\
    .assign(age = df['age'])
    return table

table(df)

Could someone please let me know why it does not show the age the way it is supposed to?

CodePudding user response：

It gives NaN values as it does not understand what you want. You are assigning a column to a smaller dataframe as you are grouping by year and id. So the question is, what do you want to happen to the column age?

With my dummy dataset:

df = pd.DataFrame({'year': [2021, 2021, 2021, 2022],
                   'id': [1, 1, 2, 1],
                   'sales': [20, 30, 40, 50],
                   'age': [3, 4, 5, 6]})

Average age

If you want to sum or take the average of the column age, do a similar thing as you did with sales:

df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'mean'))

Output:

         sales  age
year id            
2021 1      50  3.5
     2      40  5.0
2022 1      50  6.0

Give all unique values of age

Do you want to get a list of all values of age, use the argument unique instead of mean:

df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'unique'))

Output:

         sales     age
year id               
2021 1      50  [3, 4]
     2      40     [5]
2022 1      50     [6]

Single value for age

If the age is the same within a year/id group, you can take the minimum (or maximum, it does not matter as they are the same). Note that in my dummy dataset they are not the same:

df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'min'))

Output:

         sales  age
year id            
2021 1      50    3
     2      40    5
2022 1      50    6