I have an R data frame like the example below (but with ten of thousands of rows).
A1 <- c("AB AC AD AE AF AG","AB AD AH AI AJ")
q1 <- c("AB AC AE","AD AJ AI")
id <- 1:2
df <- data.frame(id,A1,q1)
I would like the result to look like this:
df$A_clean <- c("AD AF AG", "AB AH")
I have tried using the "str_split" and "exclude" function that is part of the qdap package but this seems to work on the whole column at once rather than on a row-by-row basis in the data frame and just gives me the unique words in each row of A1 excluding all the words in the q1 column.
CodePudding user response:
Here is a base R method:
check_list <- list(
A1 = strsplit(df$A1, "\\s"),
q1 = strsplit(df$q1, "\\s")
)
df$A_clean <- lapply(
seq_len(nrow(df)),
\(i) {
A1 <- check_list[["A1"]][[i]]
q1 <- check_list[["q1"]][[i]]
vals_to_keep <- A1[!A1 %in% q1]
paste(vals_to_keep, collapse = " ")
}
)
df
# id A1 q1 A_clean
# 1 1 AB AC AD AE AF AG AB AC AE AD AF AG
# 2 2 AB AI AD AH AI AJ AD AI AJ AB AH
CodePudding user response:
In base R, with intersect
:
mapply(\(x, y) paste(setdiff(x, y), collapse = " "),
strsplit(df$A1, " "), strsplit(df$q1, " "))
#[1] "AD AF AG" "AB AH"
CodePudding user response:
Using just gsub()
(not removing duplicates in A1):
df$A1 = mapply(\(x, y) gsub(y, "", x), df$A1, gsub(" ", " ?| ?", df$q1))
df$A1
# [1] "AD AF AG" "AB AH"
Data
A1 <- c("AB AC AD AE AF AG","AB AD AH AI AJ")
q1 <- c("AB AC AE","AD AJ AI")
id <- 1:2
df <- data.frame(id,A1,q1)