Pandas - Add mean, max, min as row in dataframe-CodePudding

Dataframe evntually converts to Excel... Trying to create a additional row with the avg and max above each column.

Do not want to disturb the original headers for the actual data.

I dont want to hard-code column names as these will change need kind of abstract. I attempted to create a max but failed. I need the max above the column headers.

CodePudding user response：

Try this, I don't know how to create above the dataframe, but I believe that in the end it might be a good solution:

import pandas as pd
df = {
    'date and time':['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04'],
    '<PowerAC--->':[40, 20, 9, 12]
              }
df = pd.DataFrame(df)

cols = ['<PowerAC--->']
agg = (df[cols].agg(['mean', max]))

out = pd.concat([df, agg])

print(out)

CodePudding user response：

A one-liner method which also remove the "NaN" values to make it visually better (I'm a bit OCD ;))

df.append(df.agg({'<PowerAC--->' : ['mean', max]})).fillna('')

CodePudding user response：

I would say it's a good idea to keep your data separated from the reporting on it - I don't really see the logic for an "additional row above the column".

I would generate statistics for the overall data as a separate dataframe.

import pandas as pd
import numpy as np

np.random.seed(1)
t = pd.date_range(start='2022-05-31', end='2022-06-07')
x = np.random.rand(len(t))

df = pd.DataFrame({'date': t, 'data': x})
print(df)

# The 'numeric_only=False' behaviour will become default in a future version of pandas
d_mean = df.mean(numeric_only=False)
d_max = df.max()
# We need to transpose this, as the `d_mean` and `d_max` are Series (columns), and we want them as rows
df_stats = pd.DataFrame({'mean': d_mean, 'max':d_max}).transpose()
print(df_stats)

df output:

        date      data
0 2022-05-31  0.417022
1 2022-06-01  0.720324
2 2022-06-02  0.000114
3 2022-06-03  0.302333
4 2022-06-04  0.146756
5 2022-06-05  0.092339
6 2022-06-06  0.186260
7 2022-06-07  0.345561

df_stats output:

                    date      data
mean 2022-06-03 12:00:00  0.276339
max  2022-06-07 00:00:00  0.720324

You could add this and the dataframe together with

pd.concat([df_stats, df])

which looks like

                    date      data
mean 2022-06-03 12:00:00  0.276339
max  2022-06-07 00:00:00  0.720324
0    2022-05-31 00:00:00  0.417022
1    2022-06-01 00:00:00  0.720324
2    2022-06-02 00:00:00  0.000114
3    2022-06-03 00:00:00  0.302333
4    2022-06-04 00:00:00  0.146756
5    2022-06-05 00:00:00  0.092339
6    2022-06-06 00:00:00   0.18626
7    2022-06-07 00:00:00  0.345561

but I would keep them separate unless you've got a very good reason to.

There may be some way which makes sense using a multi-index, but that's a bit beyond me, and probably beyond the scope of this question.

Edit: If you don't infer any meaning from the max and mean of the date column but still want something compatiable with that column (i.e. still a datetime but effectively null) you could replace it by np.datetime64['NaT'] (NaT similar to NaN, but "not a time"):

df_stats['date'] = np.datetime64['NaT']
print(pd.concat([df_stats, df]).head())

output:

           date      data
mean        NaT  0.276339
max         NaT  0.720324
0    2022-05-31  0.417022
1    2022-06-01  0.720324
2    2022-06-02  0.000114