Identify first occurence in dataframe in R and subsequently label every third row-CodePudding

I have a large dataframe which I would like to downsample from 120hz to 40hz. It's not consistent enough though that I could just take every third row from the dataframe.

I have something like the following dataframe:

df <- data.frame(RelativeTime = c(0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550), 
         Marker = c("start_trial", "", "", "", "", "", "", "", "", "", "", "", "start_trial", "", "", "", "", "", "", "", "", "", "", ""), 
         size=c(NA, -1, -1, 4, -1, -1 , 3.5, -1, -1, 4, -1, -1, NA, 4, -1, -1, 2, -1, -1, -1, -1, -1, 4.5, -1),
         trial=c("trial 1", "trial 1", "trial 1", "trial 1", "trial 1", "trial 1", "trial 1", "trial 1", "trial 1", "trial 1", "trial1", "trial1", "trial 2", "trial 2", "trial 2", "trial 2", "trial 2", "trial 2", "trial 2", "trial 2", "trial 2", "trial 2", "trial 2", "trial 2" ))

What I would like is for to identify the first non-0 entry in size for each trial. Then starting from there filter the data.frame so only every third row is retained. As you can see in this example as well, the starting point for each first non-0 number is variable relative to "trial_onset", which is giving me issues! Furthermore, the reason I need this particular approach (instead of just filtering out non-0 values) is that it is well possible that sometimes the "size" value is also -1 just due to a failed data collection point. I need to retain those for further processing. So, I would like to end up with the following dataframe:

df2 <- data.frame(RelativeTime = c(0, 150, 300, 450, 0, 50, 200, 350, 500), 
         Marker = c("start_trial", "", "", "", "start_trial", "", "", "", ""), 
         size=c(NA, 4, 3.5, 4, NA, 4, 2, -1, 4.5),
         trial=c("trial 1", "trial 1", "trial 1", "trial 1", "trial 2", "trial 2", "trial 2", "trial 2", "trial 2"))

Thanks a lot for helping out!

CodePudding user response：

Are you looking for this?

library(dplyr) 

df %>%
  group_by(trial) %>%
  filter(cumsum(coalesce(size, -1) > 0) >= 1) %>%
  filter(1:n() %% 3 == 1)

# A tibble: 7 x 4
# Groups:   trial [2]
  RelativeTime Marker  size trial  
         <dbl> <chr>  <dbl> <chr>  
1          150 ""       4   trial 1
2          300 ""       3.5 trial 1
3          450 ""       4   trial 1
4           50 ""       4   trial 2
5          200 ""       2   trial 2
6          350 ""       5   trial 2
7          500 ""       4.5 trial 2

If you want to keep the first row for each trial as well you can add them using bind_rows.

CodePudding user response：

Another solution:

library(dplyr)
df %>% 
  group_by(trial) %>% 
  mutate(start = cumany(size > 0)) %>% 
  group_by(trial, start) %>% 
  filter(Marker == "start_trial" | seq(start[start == T]) %% 3 == 1)

  RelativeTime Marker         size trial   start
         <dbl> <chr>         <dbl> <chr>   <lgl>
1            0 "start_trial"  NA   trial 1 NA   
2          150 ""              4   trial 1 TRUE 
3          300 ""              3.5 trial 1 TRUE 
4          450 ""              4   trial 1 TRUE 
5            0 "start_trial"  NA   trial 2 NA   
6           50 ""              4   trial 2 TRUE 
7          200 ""              2   trial 2 TRUE 
8          350 ""              5   trial 2 TRUE 
9          500 ""              4.5 trial 2 TRUE