I am working with the R programming language.
I have the following dataset:
v <- c(1,2,3,4,5,6,7,8,9,10)
var_1 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_2 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_3 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_4 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_5 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
my_data = data.frame(var_1, var_2, var_3, var_4, var_5)
I also have another dataset of "conditions" that will be used for querying this data frame:
conditions = data.frame(cond_1 = c("1,3,4", "4,5,6"), cond_2 = c("5,6", "7,8,9"))
My Question: I tried to run the following command to select rows from "my_data" based on the first row of "conditions" - but this returns an empty result:
my_data[my_data$var_1 %in% unlist(conditions[1,1]) &
my_data$var_2 %in% unlist(conditions[1,2]), ]
[1] var_1 var_2 var_3 var_4 var_5
<0 rows> (or 0-length row.names)
I tried to look more into this by "inspecting" these conditions:
class(conditions[1,1])
[1] "character"
This makes me think that the "unlist()" command is not working because the conditions themselves are a "character" instead of a "list".
Is there an equivalent command that can be used here that plays the same role as the "unlist()" command so that the above statement can be run?
In general, I am trying to produce the same results as I would have gotten from this code - but keeping the format I was using above:
my_data[my_data$var_1 %in% c("1", "3", "4") &
my_data$var_2 %in% c("5", "6"), ]
Thanks!
Reference: Selecting Rows of Data Based on Multiple Conditions
CodePudding user response:
Up front, "1,3,4" != 1
. It seems you should look to split the strings using strsplit(., ",")
.
expected <- my_data[my_data$var_1 %in% c("1", "3", "4") & my_data$var_2 %in% c("5", "6"), ]
head(expected)
# var_1 var_2 var_3 var_4 var_5
# 18 3 6 2 2 9
# 129 3 5 3 2 8
# 133 4 5 6 5 8
# 186 1 6 6 10 10
# 204 4 6 4 2 6
# 207 1 5 3 2 9
out <- my_data[do.call(`&`,
Map(`%in%`,
lapply(my_data[,1:2], as.character),
lapply(conditions, function(z) strsplit(z, ",")[[1]]))),]
head(out)
# var_1 var_2 var_3 var_4 var_5
# 18 3 6 2 2 9
# 129 3 5 3 2 8
# 133 4 5 6 5 8
# 186 1 6 6 10 10
# 204 4 6 4 2 6
# 207 1 5 3 2 9
Edit: update for new conditions
: change do.call
to Reduce
:
conditions = data.frame(cond_1 = c("1,3,4", "4,5,6"), cond_2 = c("5,6", "7,8,9"), cond_3 = c("4,6", "9"))
out <- my_data[Reduce(`&`,
Map(`%in%`,
lapply(my_data[,1:3], as.character),
lapply(conditions, function(z) strsplit(z, ",")[[1]]))),]
head(out)
# var_1 var_2 var_3 var_4 var_5
# 133 4 5 6 5 8
# 186 1 6 6 10 10
# 204 4 6 4 2 6
# 232 1 5 6 5 8
# 332 3 6 6 5 10
# 338 1 5 6 3 6
CodePudding user response:
Too long for a comment, but since I answered your previous question, I noticed an important difference in how your conditions
data frames look like. Perhaps that is the reason for your confusion.
I assume, you want/need lists in the cells but in this version you have strings that need to be split as @r2evans showed you in a great answer.
To create the conditions
data frame that resembles more that one of your previous question you may use list2DF
(also see this related answer).
conditions1 <- list2DF(list(cond_1=list(c(1, 3, 4), c(4, 5, 6)),
cond_2=list(c(5, 6), c(7, 8, 9))))
conditions1
# cond_1 cond_2
# 1 1, 3, 4 5, 6
# 2 4, 5, 6 7, 8, 9
where:
str(conditions1)
# 'data.frame': 2 obs. of 2 variables:
# $ cond_1:List of 2
# ..$ : num 1 3 4
# ..$ : num 4 5 6
# $ cond_2:List of 2
# ..$ : num 5 6
# ..$ : num 7 8 9
Your conditions
looks similar
conditions
# cond_1 cond_2
# 1 1,3,4 5,6
# 2 4,5,6 7,8,9
but:
str(conditions)
# 'data.frame': 2 obs. of 2 variables:
# $ cond_1: chr "1,3,4" "4,5,6"
# $ cond_2: chr "5,6" "7,8,9"
CodePudding user response:
Use the data generated in the Note at the end. It uses set.seed
to make it reproducible and also uses a reduced number of rows as the point can be illustrated just as well with fewer.
Generate the condition string cond
which for this example is "var_1 in (1,3,4) and var_2 in (5,6)"
and then insert it into an SQL statement which we run.
library(sqldf)
s <- sprintf("%s in (%s)", names(my_data)[1:ncol(conditions)], conditions[1, ])
cond <- paste(s, collapse = " and ")
fn$sqldf("select * from my_data where $cond")
## var_1 var_2 var_3 var_4 var_5
## 1 1 6 8 6 1
## 2 4 6 10 8 7
# check
my_data[my_data$var_1 %in% c("1", "3", "4") & my_data$var_2 %in% c("5", "6"), ]
## var_1 var_2 var_3 var_4 var_5
## 11 1 6 8 6 1
## 17 4 6 10 8 7
Note
set.seed(123)
conditions <- data.frame(cond_1 = c("1,3,4", "4,5,6"),
cond_2 = c("5,6", "7,8,9"))
v <- c(1,2,3,4,5,6,7,8,9,10)
p <- c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)
nr <- 25
var_1 <- as.factor(sample(v, nr, replace = TRUE, prob = p))
var_2 <- as.factor(sample(v, nr, replace = TRUE, prob = p))
var_3 <- as.factor(sample(v, nr, replace = TRUE, prob = p))
var_4 <- as.factor(sample(v, nr, replace = TRUE, prob = p))
var_5 <- as.factor(sample(v, nr, replace = TRUE, prob = p))
my_data = data.frame(var_1, var_2, var_3, var_4, var_5)
conditions <- data.frame(cond_1 = c("1,3,4", "4,5,6"),
cond_2 = c("5,6", "7,8,9"))