How to filter data frame by a minimal difference by groups-CodePudding

I'm struggling on how can I keep on my dataframe only the elements that the difference of values in measure are minimal 2 for each group in id1. Here is a toy example:

test = data.frame(measure = c(1, 2, 3, 2, 2, 4, 0, 1),
                  id1 = c("A", "B", "A", "B", "C", "C", "A", "D"),
                  id2 = c("16", "16", "16", "16", "17", "17", "17", "18"))
# > test
#   measure id1 id2
# 1       1   A  16
# 2       2   B  16
# 3       3   A  16
# 4       2   B  16
# 5       2   C  17
# 6       4   C  17
# 7       0   A  17
# 8       1   D  18

#as result, I'd like something like
#> res
#   measure id1 id2
# 1       1   A  16
# 3       3   A  16
# 5       2   C  17
# 6       4   C  17
# 7       0   A  17

The id1 equal A and D where removed, since their max(value) - min(value) < 2.

Any hint on how can I apply this?

CodePudding user response：

Another tidyverse option.

test %>% 
  group_by(id1) %>% 
  filter((max(measure)- min(measure)) >= 2)

CodePudding user response：

A possible solution:

library(tidyverse)

test = data.frame(measure = c(1, 2, 3, 2, 2, 4, 0, 1),
                  id1 = c("A", "B", "A", "B", "C", "C", "A", "D"),
                  id2 = c("16", "16", "16", "16", "17", "17", "17", "18"))

test %>% 
  group_by(id1) %>% 
  mutate(aux = (max(c_across(measure))- min(c_across(measure))) >= 2) %>% 
  ungroup %>% filter(aux) %>% select(-aux)

#> # A tibble: 5 × 3
#>   measure id1   id2  
#>     <dbl> <chr> <chr>
#> 1       1 A     16   
#> 2       3 A     16   
#> 3       2 C     17   
#> 4       4 C     17   
#> 5       0 A     17

CodePudding user response：

The data.table structure provides an option here.

# Load library
library(data.table)
# Create data.table
test <- 
  data.table(measure = c(1, 2, 3, 2, 2, 4, 0, 1),
             id1 = c("A", "B", "A", "B", "C", "C", "A", "D"),
             id2 = c("16", "16", "16", "16", "17", "17", "17", "18"))
# Solution:
test[id1 %in% test[, .(max(measure)-min(measure)), by=id1][V1>=2, id1]]

In this solution, test[, .(max(measure)-min(measure)), by=id1] is giving a data.table of differences and [V1>=2, id1] is subsetting that and returning a vector of valid id1 values. The id1 %in% then subsets the original data on the contents of that vector.

The result is:

 measure id1 id2
1:       1   A  16
2:       3   A  16
3:       2   C  17
4:       4   C  17
5:       0   A  17

A mini data.table lesson:

With data.tables, we use the notation DT[i, j, by] where:
- DT is the data.table
- i is the operations on observations (e.g. filter, sort)
- j is the operations on variables (e.g. select, process)
- by (and keyby) provides grouping for the j-operations
data.tables have some special characters including
- .N for "number of observations"
- .SD for "selected variables"
Install and load the data.table package to use this structure.
Convert an existing data.frame, list or tibble into a data.table using setDT() or data.table()
Use fread() to read external data files directly into a data.table.

An example - To get the number of cars and mean fuel efficiency of cars from mtcars, where hp>100 and summary stats grouped by number of cylinders:

dtcars <- data.table(mtcars)
dtcars[hp>100, .("n" = .N, "mean_mpg" = mean(mpg)), keyby=cyl]