Home > Software engineering >  Keep only rows if number is greater than... in specific column
Keep only rows if number is greater than... in specific column

Time:12-16

This is an example of data:

exp_data <- structure(list(Seq = c("AAAARVDS", "AAAARVDSSSAL", 
                                       "AAAARVDSRASDQ"), Change = structure(c(19L, 20L, 13L), .Label = c("", 
                                                                                                          "C[ 58]", "C[ 58], F[ 1152]", "C[ 58], F[ 1152], L[ 12], M[ 12]", 
                                                                                                          "C[ 58], L[ 2909]", "L[ 12]", "L[ 370]", "L[ 504]", "M[ 12]", 
                                                                                                          "M[ 1283]", "M[ 1457]", "M[ 1491]", "M[ 16]", "M[ 16], Y[ 1013]", 
                                                                                                          "M[ 16], Y[ 1152]", "M[ 16], Y[ 762]", "M[ 371]", "M[ 386], Y[ 12]", 
                                                                                                          "M[ 486], W[ 12]", "Y[ 12]", "Y[ 1240]", "Y[ 1502]", "Y[ 1988]", 
                                                                                                          "Y[ 2918]"), class = "factor"), `Mass` = c(1869.943, 
                                                                                                                                                              1048.459, 707.346), Size = structure(c(2L, 2L, 2L), .Label = c("Matt", 
                                                                                                                                                                                                                                "Greg", 
                                                                                                                                                                                                                                "Kieran"
                                                                                                                                                              ), class = "factor"), `Number` = c(2L, 2L, 2L)), row.names = c(244L, 
                                                                                                                                                                                                                                392L, 396L), class = "data.frame")

I would like to bring your attention to column name Change as this is the one which I would like to use for filtering. We have three rows here and I would like to keep only first one because there is a change bigger than 100 for specific letter. I would like to keep all of the rows which contain the change of letter greater than 100. It might be a situatation that there is up to 4-5 letters in change column but if there is at least one with modification of at least 100 I would like to keep this row.

Do you have any simple solution for that ?

Expected output:

              Seq          Change     Mass Size Number
244      AAAARVDS M[ 486], W[ 12] 1869.943 Greg      2

CodePudding user response:

Not entirely sure I understood your problem statement correctly, but perhaps something like this

library(dplyr)
library(stringr)
exp_data %>% filter(str_detect(Change, "\\d{3}"))
#       Seq          Change     Mass Size Number
#1 AAAARVDS M[ 486], W[ 12] 1869.943 Greg      2 

Or the same in base R

exp_data[grep("\\d{3}", exp_data$Change), ]
#       Seq          Change     Mass Size Number
#1 AAAARVDS M[ 486], W[ 12] 1869.943 Greg      2 

The idea is to use a regular expression to keep only those rows where Change contains at least one three-digit expression.

CodePudding user response:

You can use str_extract_all from the stringr package

library(stringr)

data.table solution

library(data.table)
setDT(exp_data)

exp_data[, max := max(as.numeric(str_extract_all(Change, "[[:digit:]] ")[[1]])), by = Seq]
exp_data[max > 100, ]

        Seq          Change   Mass Size Number max
1: AAAARVDS M[ 486], W[ 12] 1869.9 Greg      2 486

dplyr solution

library(dplyr)

exp_data %>% 
  group_by(Seq) %>% 
  filter(max(as.numeric(str_extract_all(Change, "[[:digit:]] ")[[1]])) > 100)

# A tibble: 1 x 5
# Groups:   Seq [1]
  Seq      Change           Mass Size  Number
  <chr>    <fct>           <dbl> <fct>  <int>
1 AAAARVDS M[ 486], W[ 12] 1870. Greg       2
  •  Tags:  
  • r
  • Related