Home > Blockchain >  Filter out all rows of a df below a certain value of a particular column
Filter out all rows of a df below a certain value of a particular column

Time:09-22

I have a df which has two columns, time and val. The df is arranged by time. I want to filter out all the rows from the max value, in this case it is 1.29. I have provided example below:

library(tidyverse)
library(lubridate)


# This is the entire df
df1 <- tibble::tribble(
  ~date, ~val,
  "2021-09-16 11:02:45", 1.21,
  "2021-09-16 11:02:45", 1.21,
  "2021-09-16 11:02:45", 1.21,
  "2021-09-16 11:02:45", 1.22,
  "2021-09-16 11:02:45", 1.22,
  "2021-09-16 11:02:45", 1.22,
  "2021-09-16 11:02:37", 1.22,
  "2021-09-16 10:59:29", 1.29,
  "2021-09-16 10:59:14", 1.29,
  "2021-09-16 10:59:14", 1.28,
  "2021-09-16 10:59:14", 1.28,
  "2021-09-16 10:58:17", 1.28,
  "2021-09-16 10:58:17", 1.28,
  "2021-09-16 10:58:05", 1.26,
  "2021-09-16 10:58:05", 1.26,
  "2021-09-16 10:58:05", 1.23,
  "2021-09-16 10:57:16", 1.23
  
  ) %>%
  mutate(date = ymd_hms(date))


# This is the outcome I am looking for
tibble::tribble(
  ~date, ~val,
  "2021-09-16 10:59:29", 1.29,
  "2021-09-16 10:59:14", 1.29,
  "2021-09-16 10:59:14", 1.28,
  "2021-09-16 10:59:14", 1.28,
  "2021-09-16 10:58:17", 1.28,
  "2021-09-16 10:58:17", 1.28,
  "2021-09-16 10:58:05", 1.26,
  "2021-09-16 10:58:05", 1.26,
  "2021-09-16 10:58:05", 1.23,
  "2021-09-16 10:57:16", 1.23
  
) %>%
  mutate(date = ymd_hms(date))

How to do this efficiently, any ideas?

CodePudding user response:

If I understood correctly this might solve your problem

library(dplyr)

df1 %>% 
  filter(date <= first(date[val == max(val)]))

# A tibble: 10 x 2
   date                  val
   <dttm>              <dbl>
 1 2021-09-16 10:59:29  1.29
 2 2021-09-16 10:59:14  1.29
 3 2021-09-16 10:59:14  1.28
 4 2021-09-16 10:59:14  1.28
 5 2021-09-16 10:58:17  1.28
 6 2021-09-16 10:58:17  1.28
 7 2021-09-16 10:58:05  1.26
 8 2021-09-16 10:58:05  1.26
 9 2021-09-16 10:58:05  1.23
10 2021-09-16 10:57:16  1.23

CodePudding user response:

Here are couple of other dplyr options using match.

  1. Using slice -
library(dplyr)
df1 %>% slice(match(max(val), val):n())

#   date                  val
#   <dttm>              <dbl>
# 1 2021-09-16 10:59:29  1.29
# 2 2021-09-16 10:59:14  1.29
# 3 2021-09-16 10:59:14  1.28
# 4 2021-09-16 10:59:14  1.28
# 5 2021-09-16 10:58:17  1.28
# 6 2021-09-16 10:58:17  1.28
# 7 2021-09-16 10:58:05  1.26
# 8 2021-09-16 10:58:05  1.26
# 9 2021-09-16 10:58:05  1.23
#10 2021-09-16 10:57:16  1.23
  1. Using filter
df1 %>% filter(row_number() >= match(max(val), val))

You may also use base R to do the same.

df1[match(max(df1$val), df1$val):nrow(df1), ]         

CodePudding user response:

We can use

library(dplyr)
df1 %>% 
    filter(row_number() >=which.max(val))

-output

# A tibble: 10 x 2
   date                  val
   <dttm>              <dbl>
 1 2021-09-16 10:59:29  1.29
 2 2021-09-16 10:59:14  1.29
 3 2021-09-16 10:59:14  1.28
 4 2021-09-16 10:59:14  1.28
 5 2021-09-16 10:58:17  1.28
 6 2021-09-16 10:58:17  1.28
 7 2021-09-16 10:58:05  1.26
 8 2021-09-16 10:58:05  1.26
 9 2021-09-16 10:58:05  1.23
10 2021-09-16 10:57:16  1.23

CodePudding user response:

df1 %>%
  filter(cumsum(val == max(val)) >= 1)

Here, we keep rows for which the cumulative times we've reached the max value is at least one.

I assume here that the data is sorted by date already.

  • Related