I'd like to calculate median value of data based on the same timestamp with Pandas.
An example of my partial dataframe looks like this
timestamp | data |
---|---|
8/2/2021 16:11:34 | 30 |
8/2/2021 16:11:34 | 32 |
8/2/2021 16:11:34 | 50 |
10/12/2021 19:25:17 | 40 |
10/12/2021 19:25:17 | 50 |
10/12/2021 19:25:17 | 60 |
I'd like this df to look like this:
timestamp | data | median |
---|---|---|
8/2/2021 16:11:34 | 30 | 32 |
8/2/2021 16:11:34 | 32 | 32 |
8/2/2021 16:11:34 | 50 | 32 |
10/12/2021 19:25:17 | 40 | 50 |
10/12/2021 19:25:17 | 50 | 50 |
10/12/2021 19:25:17 | 60 | 50 |
I was thinking of creating a new column and create a for loop to collect data of same timestamp into an array then calculate, but also thought that it might take too long to calculate (data record around 30,000) looked for easy solution but couldn't find a solution. Any suggestion would be very much appreciated.
CodePudding user response:
One option is to call median
in groupby.transform
. It returns a Series having the same indices as df
filled with the transformed median values for each group.
df['median'] = df.groupby('timestamp').transform('median')
Output:
timestamp data median
0 8/2/2021 16:11:34 30 32.0
1 8/2/2021 16:11:34 32 32.0
2 8/2/2021 16:11:34 50 32.0
3 10/12/2021 19:25:17 40 50.0
4 10/12/2021 19:25:17 50 50.0
5 10/12/2021 19:25:17 60 50.0