Writing a function or loop to replace data based on two conditions, one of which is time-CodePudding

I have a very large dataset (1,017,211 rows) that contains pressure data collected in 1 second intervals. Random spikes in pressure will happen and I need to remove them. The data (df) will look like this - a column for pressure, date and time, and an index I created in place of seconds. I have included about 50 rows from the data that would all need to be set to zero.

dput(df[9549:9600,]) 

structure(list(Pressure = c(0, 10272, 10270, 10270, 10307, 
10332, 10336, 10336, 10335, 10335, 10335, 10336, 10333, 10333, 
10332, 10330, 10330, 10329, 10328, 10328, 10328, 10328, 10327, 
10327, 10326, 10326, 10325, 10325, 10324, 10324, 10323, 10323, 
10323, 10322, 10322, 10321, 10321, 10321, 10321, 10320, 10320, 
10320, 10319, 10318, 10318, 10317, 10318, 10388, 10581, 10787, 
11007,0 ), DateTime = structure(c(1611279415, 1611279416, 1611279417, 
1611279418, 1611279419, 1611279420, 1611279421, 1611279422, 1611279423, 
1611279424, 1611279425, 1611279426, 1611279427, 1611279428, 1611279429, 
1611279430, 1611279431, 1611279432, 1611279433, 1611279434, 1611279435, 
1611279436, 1611279437, 1611279438, 1611279439, 1611279440, 1611279441, 
1611279442, 1611279443, 1611279444, 1611279445, 1611279446, 1611279447, 
1611279448, 1611279449, 1611279450, 1611279451, 1611279452, 1611279453, 
1611279454, 1611279455, 1611279456, 1611279457, 1611279458, 1611279459, 
1611279460, 1611279461, 1611279462, 1611279463, 1611279464, 1611279465
), class = c("POSIXct", "POSIXt"), tzone = ""), Index = 9549:9600), row.names = 9549:9600, class = "data.frame")

My goal is to code a function/loop/script that will send pressure readings to 0 (or NA or anything less than 10,000) if the amount of time that the pressure is over 10,000 is less than 2 minutes. As an example, if pressure hovers around /- 11,000 for 1 minute, those 60 pressure data points would be sent to 0.

The 10,000 mark is set because I am not too worried about anything below that. What is most challenging is coding the time aspect correctly and I am not even sure if R can do this easily.

I tried to make a for loop looking only at pressure readings above 10,000 and attempting to sum the seconds to see if time spent above 10,000 was less than 2 mins (120 seconds):

for (i in df$Pressure) {
  if(df$Pressure > 10000 & sum(nrow(df$Index)) < 120) {
    df$Pressure = 0
  } else(df$Pressure <- df$Pressure)
}

This resulted in errors saying: "the condition has length > 1 and only the first element will be used".

I also tried using setDT() as others have suggested (as in here - Sum values from one df base on two columns of another df):

pressure.df <- setDT(df)[Pressure > 10000 & sum(nrow(Index)) < 120, Pressure := 0]

This resulted in a dataframe that sent every value over 10,000 to 0.

Any help would be appreciated and it is not necessary to use the index to count the time if using the DateTime column would be better.

CodePudding user response：

With data.table and dplyr:

we may use data.table::rleid for grouping and use the n() for every group. Then use replace to replace all values that meet the condition (Pressure>1000 and <60 lines), for every group. The following answer will only work if there is strictly one observation for every second. If there are missing rows or duplicate DateTime values, it may yield inconsistent results

library(data.table)
lbrary(dplyr)

df %>% group_by(rleid=data.table::rleid(Pressure>10000)) %>%
       mutate(Pressure=replace(Pressure, n()<60 & Pressure >10000, 0)%>%
       ungroup()