Home > front end >  NaN when adding column to the aggregated table
NaN when adding column to the aggregated table

Time:10-17

I am supposed to aggregate a table of sales by year and id, then also add an age column to the table. Unfortunately when I add the last column, I keep getting NaN values for that column only. Like so:

|year|id|sales|age|
__________________
2022 |1 | 200| NaN|
     |2 | 342| NaN|
2021 |34| 500| NaN|
     |10| 20 | NaN|
     |7 | 4200| Nan|

I have an "age" column in my df, so each id has an age assigned to it. Here is my code:

df['sales'] = df.price*df.quantity
df['year'] = pd.DatetimeIndex(df['date']).year 

def table(df):
    test = order.groupby(['year','id'])\
    .agg(sales = ('sales', 'sum'))\
    .assign(age = df['age'])
    return table

table(df)

Could someone please let me know why it does not show the age the way it is supposed to?

CodePudding user response:

It gives NaN values as it does not understand what you want. You are assigning a column to a smaller dataframe as you are grouping by year and id. So the question is, what do you want to happen to the column age?

With my dummy dataset:

df = pd.DataFrame({'year': [2021, 2021, 2021, 2022],
                   'id': [1, 1, 2, 1],
                   'sales': [20, 30, 40, 50],
                   'age': [3, 4, 5, 6]})

Average age

If you want to sum or take the average of the column age, do a similar thing as you did with sales:

df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'mean'))

Output:

         sales  age
year id            
2021 1      50  3.5
     2      40  5.0
2022 1      50  6.0

Give all unique values of age

Do you want to get a list of all values of age, use the argument unique instead of mean:

df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'unique'))

Output:

         sales     age
year id               
2021 1      50  [3, 4]
     2      40     [5]
2022 1      50     [6]

Single value for age

If the age is the same within a year/id group, you can take the minimum (or maximum, it does not matter as they are the same). Note that in my dummy dataset they are not the same:

df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'min'))

Output:

         sales  age
year id            
2021 1      50    3
     2      40    5
2022 1      50    6
  • Related