Home > OS >  Given an R dataframe with two columns with strings of words in each, remove words repeating in betwe
Given an R dataframe with two columns with strings of words in each, remove words repeating in betwe

Time:10-11

I have an R data frame like the example below (but with ten of thousands of rows).

A1 <- c("AB AC AD AE AF AG","AB AD AH AI AJ")
q1 <- c("AB AC AE","AD AJ AI")
id <- 1:2
df <- data.frame(id,A1,q1)

I would like the result to look like this:

df$A_clean <- c("AD AF AG", "AB AH")

I have tried using the "str_split" and "exclude" function that is part of the qdap package but this seems to work on the whole column at once rather than on a row-by-row basis in the data frame and just gives me the unique words in each row of A1 excluding all the words in the q1 column.

CodePudding user response:

Here is a base R method:


check_list  <- list(
    A1 = strsplit(df$A1, "\\s"),
    q1 = strsplit(df$q1, "\\s")
)

df$A_clean  <- lapply(
    seq_len(nrow(df)),
    \(i) {
        A1  <- check_list[["A1"]][[i]]
        q1  <- check_list[["q1"]][[i]]
        vals_to_keep  <- A1[!A1 %in% q1]
        paste(vals_to_keep, collapse = " ")
    }
)

df
#   id                A1       q1  A_clean
# 1  1 AB AC AD AE AF AG AB AC AE AD AF AG
# 2  2 AB AI AD AH AI AJ AD AI AJ    AB AH

CodePudding user response:

In base R, with intersect:

mapply(\(x, y) paste(setdiff(x, y), collapse = " "), 
       strsplit(df$A1, " "), strsplit(df$q1, " "))
#[1] "AD AF AG" "AB AH" 

CodePudding user response:

Using just gsub() (not removing duplicates in A1):

df$A1 = mapply(\(x, y) gsub(y, "", x), df$A1, gsub(" ", " ?| ?", df$q1))


df$A1
# [1] "AD AF AG" "AB AH" 

Data

A1 <- c("AB AC AD AE AF AG","AB AD AH AI AJ")
q1 <- c("AB AC AE","AD AJ AI")
id <- 1:2
df <- data.frame(id,A1,q1)
  • Related