Given an R dataframe with two columns with strings of words in each, remove words repeating in betwe-CodePudding

I have an R data frame like the example below (but with ten of thousands of rows).

A1 <- c("AB AC AD AE AF AG","AB AD AH AI AJ")
q1 <- c("AB AC AE","AD AJ AI")
id <- 1:2
df <- data.frame(id,A1,q1)

I would like the result to look like this:

df$A_clean <- c("AD AF AG", "AB AH")

I have tried using the "str_split" and "exclude" function that is part of the qdap package but this seems to work on the whole column at once rather than on a row-by-row basis in the data frame and just gives me the unique words in each row of A1 excluding all the words in the q1 column.

CodePudding user response：

Here is a base R method:


check_list  <- list(
    A1 = strsplit(df$A1, "\\s"),
    q1 = strsplit(df$q1, "\\s")
)

df$A_clean  <- lapply(
    seq_len(nrow(df)),
    \(i) {
        A1  <- check_list[["A1"]][[i]]
        q1  <- check_list[["q1"]][[i]]
        vals_to_keep  <- A1[!A1 %in% q1]
        paste(vals_to_keep, collapse = " ")
    }
)

df
#   id                A1       q1  A_clean
# 1  1 AB AC AD AE AF AG AB AC AE AD AF AG
# 2  2 AB AI AD AH AI AJ AD AI AJ    AB AH

CodePudding user response：

In base R, with intersect:

mapply(\(x, y) paste(setdiff(x, y), collapse = " "), 
       strsplit(df$A1, " "), strsplit(df$q1, " "))
#[1] "AD AF AG" "AB AH"

CodePudding user response：

Using just gsub() (not removing duplicates in A1):

df$A1 = mapply(\(x, y) gsub(y, "", x), df$A1, gsub(" ", " ?| ?", df$q1))


df$A1
# [1] "AD AF AG" "AB AH"

Data

A1 <- c("AB AC AD AE AF AG","AB AD AH AI AJ")
q1 <- c("AB AC AE","AD AJ AI")
id <- 1:2
df <- data.frame(id,A1,q1)