Home > Enterprise >  Create new dataframe that contain the average value from some of the columns in the old dataframe
Create new dataframe that contain the average value from some of the columns in the old dataframe

Time:03-11

I have a dataframe extracted from a csv file. I want to iterate a data process where only some of the columns's data is the mean of n rows, while the rest of the columns is the first row for each iteration.

For example, the data extracted from the csv consisted of 100 rows and 6 columns. I have a variable n_AVE = 6, which tells the code to average the data per 6 rows.

rawDf = pd.read_csv(outputFilePath / 'Raw_data.csv', encoding='CP932')
OUT: 
       TIME      A       B        C        D        E
0     2021/3/4   148      0       142       0        1      [0]
1     2021/3/5   148      0       142       0        1
2     2021/3/6   150      0       148       0        1
3     2021/3/7   150      0       148       0        1
4     2021/3/8   151      0       148       0        1
5     2021/3/9   151      0       148       0        1
....
91    2021/4/30  195      5       180       0        1      [5]
92    2021/5/1   195      5       180       0        1
93    2021/5/2   195      5       180       0        1
94    2021/5/3   200      5       180       0        1
95    2021/5/4   200      0       200       0        1
96    2021/5/5   200      5       200       0        1      [6]
97    2021/5/6   200      5       200       1        1
98    2021/5/7   200      5       200       1        1
99    2021/5/8   205      5       210       1        1
100   2021/5/9   205      5       210       1        1

Take only the first row of [TIME, D, E] columns
Average the data per n_AVE (6) from [A, B, C] columns.
I want to create a new dataframe which looks like this

OUT: 
       TIME         A       B         C        D        E
0     2021/3/4    149.66    0        146       0        1
....
5     2021/4/30   197.5   4.166     186.66     0        1
6     2021/5/5    168.33    5        170       0        1

The code is like this:

for x in range(0,len(rawDf.index), n_AVE): 
    df = pd.DataFrame([rawDf.iloc[[x],0], rawDf.iloc[x:(x   n_AVE),1:3].mean(), rawDf.iloc[x,4:5]])

But the code is not working because apparently when I use pandas.mean(), the dataframe's format changed into like this

df2 = rawDf.iloc[0:6,1:3].mean()
print(df2)

OUT: 
        index      0
    0     A      149.66    
    1     B       0.0
    2     C      146.0      
    [3 rows x 2 columns]

How to use pandas.mean() without losing the old format?
Or should I not use pandas.mean() and just create my own averaging code?

CodePudding user response:

You can group the dataframe by the grouper np.arange(len(df)) // 6 which groups the dataframe every six rows, then aggregate the columns using the desired aggregation functions to get the result, optionally reindex along axis=1 to reorder the columns

d = {
    'A': 'mean', 'B': 'mean', 'C': 'mean', 
    'TIME': 'first', 'D': 'first', 'E': 'first'
}

df.groupby(np.arange(len(df)) // 6).agg(d).reindex(df.columns, axis=1)

Define aggegation functions using columns index:

d = {
    **dict.fromkeys(df.columns[[0, 4, 5]], 'first'),
    **dict.fromkeys(df.columns[[1, 2, 3]], 'mean' )
}

df.groupby(np.arange(len(df)) // 6).agg(d).reindex(df.columns, axis=1)

Result

        TIME           A         B           C  D  E
0   2021/3/4  149.666667  0.000000  146.000000  0  1
1  2021/4/30  197.500000  4.166667  186.666667  0  1
2   2021/5/6  202.500000  5.000000  205.000000  1  1
  • Related