I'm struggling on how can I keep on my dataframe only the elements that the difference of values in measure
are minimal 2
for each group in id1
. Here is a toy example:
test = data.frame(measure = c(1, 2, 3, 2, 2, 4, 0, 1),
id1 = c("A", "B", "A", "B", "C", "C", "A", "D"),
id2 = c("16", "16", "16", "16", "17", "17", "17", "18"))
# > test
# measure id1 id2
# 1 1 A 16
# 2 2 B 16
# 3 3 A 16
# 4 2 B 16
# 5 2 C 17
# 6 4 C 17
# 7 0 A 17
# 8 1 D 18
#as result, I'd like something like
#> res
# measure id1 id2
# 1 1 A 16
# 3 3 A 16
# 5 2 C 17
# 6 4 C 17
# 7 0 A 17
The id1
equal A
and D
where removed, since their max(value) - min(value) < 2
.
Any hint on how can I apply this?
CodePudding user response:
Another tidyverse
option.
test %>%
group_by(id1) %>%
filter((max(measure)- min(measure)) >= 2)
CodePudding user response:
A possible solution:
library(tidyverse)
test = data.frame(measure = c(1, 2, 3, 2, 2, 4, 0, 1),
id1 = c("A", "B", "A", "B", "C", "C", "A", "D"),
id2 = c("16", "16", "16", "16", "17", "17", "17", "18"))
test %>%
group_by(id1) %>%
mutate(aux = (max(c_across(measure))- min(c_across(measure))) >= 2) %>%
ungroup %>% filter(aux) %>% select(-aux)
#> # A tibble: 5 × 3
#> measure id1 id2
#> <dbl> <chr> <chr>
#> 1 1 A 16
#> 2 3 A 16
#> 3 2 C 17
#> 4 4 C 17
#> 5 0 A 17
CodePudding user response:
The data.table structure provides an option here.
# Load library
library(data.table)
# Create data.table
test <-
data.table(measure = c(1, 2, 3, 2, 2, 4, 0, 1),
id1 = c("A", "B", "A", "B", "C", "C", "A", "D"),
id2 = c("16", "16", "16", "16", "17", "17", "17", "18"))
# Solution:
test[id1 %in% test[, .(max(measure)-min(measure)), by=id1][V1>=2, id1]]
In this solution, test[, .(max(measure)-min(measure)), by=id1]
is giving a data.table of differences and [V1>=2, id1]
is subsetting that and returning a vector of valid id1 values. The id1 %in%
then subsets the original data on the contents of that vector.
The result is:
measure id1 id2
1: 1 A 16
2: 3 A 16
3: 2 C 17
4: 4 C 17
5: 0 A 17
A mini data.table lesson:
With data.tables, we use the notation
DT[i, j, by]
where:DT
is the data.tablei
is the operations on observations (e.g. filter, sort)j
is the operations on variables (e.g. select, process)by
(andkeyby
) provides grouping for thej
-operations
data.tables have some special characters including
.N
for "number of observations".SD
for "selected variables"
Install and load the data.table package to use this structure.
Convert an existing data.frame, list or tibble into a data.table using
setDT()
ordata.table()
Use
fread()
to read external data files directly into a data.table.
An example - To get the number of cars and mean fuel efficiency of cars from mtcars, where hp>100 and summary stats grouped by number of cylinders:
dtcars <- data.table(mtcars)
dtcars[hp>100, .("n" = .N, "mean_mpg" = mean(mpg)), keyby=cyl]