I am fairly new to pandas and the answer might be obvious.
Is there a way to keep the categorical column after the groupby.agg()
operation with respect to index?
I have a dataset that looks like this,
Temp (C) Dew Point Temp (C) Rel Hum (%) Wind Spd (km/h) Weather St
0 8.1 4.5 78 4 Clear 0.0
1 8.1 4.5 78 4 Clear 2.0
2 8.1 4.5 78 4 Clear 3.0
3 8.1 4.5 78 4 Clear 3.0
4 8.1 4.5 78 4 Clear 2.0
... ... ... ... ... ... ...
[1051200 rows x 6 columns]
I used this command on this dataset, to convert the minutely data to hourly data,
N = 60
newdf = newdf.groupby( newdf.index // N).agg({'Temp (C)':'mean','Dew Point Temp (C)':'mean','Rel Hum (%)':'mean','Wind Spd (km/h)':'mean','St':'sum'})
This serves my purpose just fine. I am getting the mean of 'Temp (C)', 'Dew Point Temp (C)', 'Rel Hum (%)', 'Wind Spd (km/h)' and getting the sum of 'St'. But the problem is, the 'Weather' column is getting disappeared.
Temp (C) Dew Point Temp (C) ... Wind Spd (km/h) St
0 8.100000000000009 4.5 ... 4.0 160.0
1 6.300000000000007 5.1 ... 10.0 161.0
2 7.0 6.100000000000005 ... 7.0 162.0
3 8.199999999999992 6.199999999999993 ... 14.0 161.0
4 8.60000000000001 6.100000000000005 ... 21.0 163.0
... ... ... ... ... ...
[17520 rows x 5 columns]
I have found this thread on similar topic. But it doesnt seem to help me. Because, i am grouping the rows with respect to index. What i am doing is taking the minutely data, and converting it to hourly data.
What I want is to keep the 'Weather' column by taking the first row value of each 60 rows.
CodePudding user response:
You can add your own lambda function lambda x: x.iloc[0]
to your agg
dict:
newdf = newdf.groupby( newdf.index // N, ).agg({'Temp (C)':'mean','Dew Point Temp (C)':'mean','Rel Hum (%)':'mean','Wind Spd (km/h)':'mean', 'Weather': lambda x: x.iloc[0], 'St':'sum'})
CodePudding user response:
Create a new entry in agg
dict: 'Weather': 'first'
What I want is to keep the 'Weather' column by taking the first row value of each 60 rows.
N = 5
newdf = newdf.groupby(newdf.index // N) \
.agg({'Temp (C)': 'mean',
'Dew Point Temp (C)': 'mean',
'Rel Hum (%)': 'mean',
'Wind Spd (km/h)': 'mean',
'Weather': 'first',
'St': 'sum'})
Output:
>>> newdf
Temp (C) Dew Point Temp (C) Rel Hum (%) Wind Spd (km/h) Weather St
0 8.1 4.5 78.0 4.0 Clear 10.0