Imagine a data frame...
df <- rbind("A*YOU 1.000 0.780", "A*YOUR 1.000 0.780", "B*USE 0.800 0.678", "B*USER 0.700 1.000")
df <- as.data.frame(df)
df
... which prints...
> df
V1
1 A*YOU 1.000 0.780
2 A*YOUR 1.000 0.780
3 B*USE 0.800 0.678
4 B*USER 0.700 1.000
... and of which I would like to remove any row that does not contain exactly any element of a list (called tenables here) tenables <- c("A*YOU", "B*USE")
, so that the outcome becomes:
> df
V1
1 A*YOU 1.000 0.780
2 B*USE 0.800 0.678
Any ideas on how to solve this? Many thanks in advance.
CodePudding user response:
> df[gsub("\\s*\\d \\.*", "", df$V1) %in% tenables, ,drop=FALSE]
V1
1 A*YOU 1.000 0.780
3 B*USE 0.800 0.678
CodePudding user response:
Since you have regex specials in tenables
(*
means "0 or more of the previous character/class/group"), we cannot use fixed=TRUE
in the grep
call. As such, we need to find those specials and backslash-escape them. From there, we'll add \\b
(word-boundary) to differentiate between YOU
and YOUR
, where adding a space or any other character may be over-constraining.
## clean up tenables to be regex-friendly and precise
gsub("([].* (){}[])", "\\\\\\1", tenables)
# [1] "A\\*YOU" "B\\*USE"
## combine into a single pattern for simple use in grep
paste0("\\b(", paste(gsub("([].* (){}[])", "\\\\\\1", tenables), collapse = "|"), ")\\b")
# [1] "\\b(A\\*YOU|B\\*USE)\\b"
## subset your frame
subset(df, !grepl(paste0("\\b(", paste(gsub("([].* (){}[])", "\\\\\\1", tenables), collapse = "|"), ")\\b"), V1))
# V1
# 2 A*YOUR 1.000 0.780
# 4 B*USER 0.700 1.000
Regex explanation:
\\b(A\\*YOU|B\\*USE)\\b
^^^ ^^^ "word boundary", meaning the previous/next chars
are begin/end of string or from A-Z, a-z, 0-9, or _
^ ^ parens "group" the pattern so we can reference it
in the replacement string
^^^^^^^ literal "A", "*", "Y", "O", "U" (same with other string)
^ the "|" means "OR", so either the "A*" or the "B*" strings
CodePudding user response:
One approach using sapply
on the strsplit
column of df, only looking at the first entry of A*YOU 1.000 0.780
, respectively.
df[sapply(strsplit(df$V1, " "), function(x)
any(grepl(x[1], tenables))), , drop=F]
V1
2 A*YOU 1.000 0.780
4 B*USE 0.800 0.678