Home > Software design >  Pandas: How to keep the categorical column after the groupby.agg() operation with respect to index?
Pandas: How to keep the categorical column after the groupby.agg() operation with respect to index?

Time:11-30

I am fairly new to pandas and the answer might be obvious. Is there a way to keep the categorical column after the groupby.agg() operation with respect to index? I have a dataset that looks like this,

        Temp (C) Dew Point Temp (C) Rel Hum (%) Wind Spd (km/h) Weather   St
0            8.1                4.5          78               4   Clear  0.0
1            8.1                4.5          78               4   Clear  2.0
2            8.1                4.5          78               4   Clear  3.0
3            8.1                4.5          78               4   Clear  3.0
4            8.1                4.5          78               4   Clear  2.0
...          ...                ...         ...             ...     ...  ...
 
[1051200 rows x 6 columns]

I used this command on this dataset, to convert the minutely data to hourly data,

N = 60
newdf = newdf.groupby( newdf.index // N).agg({'Temp (C)':'mean','Dew Point Temp (C)':'mean','Rel Hum (%)':'mean','Wind Spd (km/h)':'mean','St':'sum'})

This serves my purpose just fine. I am getting the mean of 'Temp (C)', 'Dew Point Temp (C)', 'Rel Hum (%)', 'Wind Spd (km/h)' and getting the sum of 'St'. But the problem is, the 'Weather' column is getting disappeared.

                 Temp (C)   Dew Point Temp (C)  ... Wind Spd (km/h)     St
0       8.100000000000009                  4.5  ...             4.0  160.0
1       6.300000000000007                  5.1  ...            10.0  161.0
2                     7.0    6.100000000000005  ...             7.0  162.0
3       8.199999999999992    6.199999999999993  ...            14.0  161.0
4        8.60000000000001    6.100000000000005  ...            21.0  163.0
...                   ...                  ...  ...             ...    ...
 

[17520 rows x 5 columns]

I have found this thread on similar topic. But it doesnt seem to help me. Because, i am grouping the rows with respect to index. What i am doing is taking the minutely data, and converting it to hourly data.

What I want is to keep the 'Weather' column by taking the first row value of each 60 rows.

CodePudding user response:

You can add your own lambda function lambda x: x.iloc[0] to your agg dict:

newdf = newdf.groupby( newdf.index // N, ).agg({'Temp (C)':'mean','Dew Point Temp (C)':'mean','Rel Hum (%)':'mean','Wind Spd (km/h)':'mean', 'Weather': lambda x: x.iloc[0], 'St':'sum'})

CodePudding user response:

Create a new entry in agg dict: 'Weather': 'first'

What I want is to keep the 'Weather' column by taking the first row value of each 60 rows.

N = 5

newdf = newdf.groupby(newdf.index // N) \
             .agg({'Temp (C)': 'mean',
                   'Dew Point Temp (C)': 'mean',
                   'Rel Hum (%)': 'mean',
                   'Wind Spd (km/h)': 'mean',
                   'Weather': 'first',
                   'St': 'sum'})

Output:

>>> newdf
   Temp (C)  Dew Point Temp (C)  Rel Hum (%)  Wind Spd (km/h) Weather    St
0       8.1                 4.5         78.0              4.0   Clear  10.0
  • Related