Using regex to drop duplicated elements in columns of an R dataframe-CodePudding

I have a dummy dataframe df which has dimensions 6 X 4.

df <- data.frame(
 Hits = c("Hit1", "Hit2", "Hit3", "Hit4", "Hit5", "Hit6"),
 GO = c("GO:0005634~nucleus,", "", "GO:0005737~cytoplasm,", "GO:0005634~nucleus,GO:0005737~cytoplasm,", "", 
            "GO:0005634~nucleus,GO:0005654~nucleoplasm,"),
 KEGG = c("", "", "", "", "", ""),
 SMART = c("SM00394:RIIa,", "SM00394:RIIa,", "", "SM00054:EFh,", 
            "", "SM00394:RIIa,SM00239:C2,"))

df looks like this

The elements in the columns consist of two parts:

an identifier (e.g. GO:0005634~, SM00394: etc.)
a term (e.g. nucleus, EFh etc.)

For each column I want to retain a row if it contains atleast one term which is not present in any row above it. e.g. in the column GO rows 1 and 3 contain unique terms, so these should be retained. Row 4 contains terms which are already present in rows 1 and 3, so it should be dropped. Row 6 has one term which is not present in any row above it, hence it should also be retained.

I have been able to come up with regular expressions to extract the terms from the columns GO and SMART

Regex for GO: (?<=~).*?(?=,(?:GO:\\d ~|$))
Regex for SMART: (?<=:).*?(?=,(?:\\w \\d :|$))

But I'm unable to figure out a way to integrate the regex and the conditions mentioned above into a solution. The output should look like this

Any suggestions on how to solve this?

CodePudding user response：

Here is a general approach that will handle GO, SMART, and potentially KEGG, though it is impossible to say without any information about KEGG.

The function f below takes as arguments

x, a character vector
split, the delimiter separating items in lists
sep, the delimiter separating identifiers and terms within items

and returns a logical vector indexing the elements of x with at least one non-duplicated term.

f <- function(x, split, sep) {
    l1 <- strsplit(x, split)
    tt <- sub(paste0("^[^", sep, "]*", sep), "", unlist(l1))
    l2 <- relist(duplicated(tt), l1)
    !vapply(l2, all, NA)
}

Applying f to GO and SMART:

nms <- c("GO", "SMART")
l <- Map(f, x = df[nms], split = ",", sep = c("~", ":"))
l
## $GO
## [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE
## 
## $SMART
## [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE

Setting to "" elements of GO and SMART with zero non-duplicated terms, then filtering out empty rows, we obtain the desired result:

df2 <- df
df2[nms] <- Map(replace, df2[nms], lapply(l, `!`), "")
df2[Reduce(`|`, l), ]
##   Hits                                         GO KEGG                    SMART
## 1 Hit1                        GO:0005634~nucleus,                 SM00394:RIIa,
## 3 Hit3                      GO:0005737~cytoplasm,                              
## 4 Hit4                                                             SM00054:EFh,
## 6 Hit6 GO:0005634~nucleus,GO:0005654~nucleoplasm,      SM00394:RIIa,SM00239:C2,

CodePudding user response：

The following algorithm is applied to each term (GO, SMART, KEGG):

extract the identifier term list as comma-separated. See stringr::str_split etc.
extract the term as regex
cumulate all the terms along the dataframe as they appear
extract the difference between each row and the row immediately preceding
replace the string with "" if no new term is introduced
filter rows where not all the terms are ""

library(dplyr)
library(stringr)
library(purrr)

termred <- function(terms, rx) {
  terms |>
    stringr::str_split(",") |>
    purrr::map(stringr::str_trim) |>
    purrr::map(~{.x[.x != ""]}) |>
    purrr::map(~stringr::str_extract(.x, rx)) |>
    purrr::accumulate(union) %>%
    {mapply(setdiff, ., lag(., 1), SIMPLIFY = TRUE)} %>%
    {ifelse(sapply(., length) > 0, terms, "")}
}

df |>
  transform(GO = termred(GO, "~.*$")) |>
  transform(SMART =  termred(SMART, ":.*$")) |>
  filter(GO != "" | SMART != ""| KEGG != "")
##>  Hits                                         GO KEGG                    SMART
##>1 Hit1                        GO:0005634~nucleus,                 SM00394:RIIa,
##>2 Hit3                      GO:0005737~cytoplasm,                              
##>3 Hit4                                                             SM00054:EFh,
##>4 Hit6 GO:0005634~nucleus,GO:0005654~nucleoplasm,      SM00394:RIIa,SM00239:C2,