I'm pulling data that has a device ID, a timestamp, and a value. I'll have lots of different device ID's and data points for each hour of the day across an entire month. If I have two pieces of data with the same ID and the same Time, but different Values, how can I remove the lower of the two (which would just be bad data)? Alternatively, I can do this step further below. I'm not sure which would be easier.
Here's an example of the data I'd be working with.
ID Time Value
1234 64058 2022-03-23 08:00:00 1014.98
1235 64059 2022-03-23 08:00:00 1001.34
1236 64060 2022-03-23 08:00:00 1001.52
1237 64061 2022-03-23 08:00:00 1483.11
1238 64062 2022-03-23 08:00:00 1342.23
1239 64063 2022-03-23 08:00:00 1201.34
1242 64058 2022-03-23 08:00:00 10.55
To make the data easier to see, I merge it with another dataframe that has actual device names rather than just ID's and then pivot it to a wider format with this.
av_device_merge <- merge(av_data, device_list, by.av_data = ID, by.device_list = ID) %>% select(-c(ID)
pivoted <- av_device_merge %>% pivot_wider(names_from = Device_Short_Name, values_from = Value)
Which yields this result (just including the one hour as an example):
Time Device 1 Device 2 Device 3 Device 4 Device 5 Device 6
2022-03-23 08:00:00 c(1014.98, 10.55) 1001.34 1001.52 1483.11 1342.23 1201.34
2022-03-23 09:00:00
...
This is the part where I could also look at each column, and if the data present is a vector, then keep the highest value only and change to a numeric value.
Which of the two would be easier to do? I'm guessing the first option? Any ideas on how to accomplish this goal?
Thanks in advance!
CodePudding user response:
Multiple ways to do this
data %>%
group_by(ID, Time) %>%
slice_max(n = 1, order_by = Value, with_ties = FALSE)
data %>%
group_by(ID, Time)%>%
top_n(1, wt = Value)
I prefer slice_max and the with_ties = FALSE in case of identical values.