split string and keep only unique values in the dataframe-CodePudding

I have a data frame with multiple terms in a single column. This type of repeating terms are within the whole data frame.

I want to remove the repeating words and only keep unique values.

for a single column, i have tried this:

data$`H021-FYR8SS` <- sapply(data$`H021-FYR8SS`, function(x) paste(unique(unlist(str_split(x,", "))), collapse = "; "))

which works perfectly fine:

I want to apply this thing in the whole data frame. How do i do this?

Thanks

               H021-2YTE1K H021-4YZREF     H021-FYR8SS                        
RP1                    " "         "SNV, SNV, SNV" " "                                
NCOA2                  " "         "SNV"           " "                                
RGS22                  " "         "SNV"           " "                                
CSMD3                  " "         "SNV, SNV"      " "                                
TNFRSF11B              " "         "SNV"           " "
SEMA3A                 " "         " "             "insertion"                        
ENPP2                  " "         " "             "deletion"                         
IGLV3-1                " "         " "             "deletion, SNV, SNV, SNV"          
MTMR11                 "SNV"       " "             " "

So, my output should look like this

               H021-2YTE1K H021-4YZREF     H021-FYR8SS                        
RP1                    " "         "SNV"           " "                                
NCOA2                  " "         "SNV"           " "                                
RGS22                  " "         "SNV"           " "                                
CSMD3                  " "         "SNV"           " "                                
TNFRSF11B              " "         "SNV"           " "
SEMA3A                 " "         " "             "insertion"                        
ENPP2                  " "         " "             "deletion"                         
IGLV3-1                " "         " "             "deletion; SNV"          
MTMR11                 "SNV"       " "             " "

CodePudding user response：

Using base:

sapply(dat,
      \(y) sapply(y, \(x) paste(unique(unlist(strsplit(x,", "))), collapse = "; ")))

Using dplyr:

library(dplyr)

dat |>
 rownames_to_column() |> 
 group_by(rowname) |>
 mutate(across(everything(), ~ paste(unique(unlist(str_split(.,", "))), collapse = "; "))) |>
 column_to_rownames() |>
 ungroup()

Output:

       H021-2YTE1K   H021-4YZREF H021-FYR8SS
BRAF      deletion                       SNV
CC2D2A   insertion                       SNV
MRPL19             deletion; SNV            
SRRM1                   deletion            
ZFP69B                 insertion            
ZNF711   insertion

Data:

dat <-
structure(list(`H021-2YTE1K` = c("insertion", "deletion", "insertion", " ", " ", " "), `H021-4YZREF` = c(" ", " ", " ", "deletion", "insertion", "deletion, SNV, SNV, SNV"), `H021-FYR8SS` = c("SNV", "SNV, SNV", " ", " ", " ", " ")), row.names = c("CC2D2A", "BRAF", "ZNF711", "SRRM1", "ZFP69B", "MRPL19"), class = "data.frame")

CodePudding user response：

Here's a base R version with some explanations.

You need to apply the same operations to a number of columns. So it is useful first to create a function that does it on a single column:

fun <- function(x){
  xs <- sapply(strsplit(x, ","), trimws)  # fist `strsplit` by comma, then trim whitespace 
                                          # [so that the number of spaces before / after commas won't matter]
  xs <- sapply(xs, unique)  # keep unique elements only
  sapply(xs, paste, collapse="; ")  # output: paste elements, separated by ";"
}

Try it out:

fun(dat[,2])
# [1] ""              ""              ""              "deletion"      "insertion"     "deletion; SNV"

You can write this function as a one-liner, which is shorter but obviously more tedious to follow:

funGolfed <- function(x) sapply(strsplit(x, ","), function(x) paste(unique(trimws(x)), collapse="; "))
# a "golfed" version, as in code golf, https://codegolf.stackexchange.com/

Or a version with pipes (possibly easier to follow):

library(magrittr)
# I am using a slightly outdated version of R - you can replace 
# the magrittr pipe "%>%" by "|>"
funPiped <- function(x) x %>% strsplit(",") %>% sapply(trimws) %>% sapply(unique) %>% sapply(paste, collapse="; ")

To apply it to all columns at once:

sapply(dat, fun)
# or funGolfed, or funPiped

And to put it back to the original data frame:

dat[,] <- lapply(dat, fun)
dat

Output

> dat
  H021-2YTE1K   H021-4YZREF H021-FYR8SS
1   insertion                       SNV
2    deletion                       SNV
3   insertion                          
4                  deletion            
5                 insertion            
6             deletion; SNV

Data

dat <- structure(list(`H021-2YTE1K` = c("insertion", "deletion", "insertion", " ", " ", " "), 
                      `H021-4YZREF` = c(" ", " ", " ", "deletion", "insertion", "deletion, SNV, SNV, SNV"), 
                      `H021-FYR8SS` = c("SNV",  "SNV, SNV", " ", " ", " ", " ")), 
                 row.names = c(NA, -6L), class = "data.frame")