How to find all rownames differing by one character and are of exact length and merge them in R-CodePudding

I am fairly new to R. I have duplicate rows that differ only by either a "." or a "-", similar to this example.

df <- data.frame(cell1 = c(0,1,2,3,4,5,6,7), cell2 = c(0,1,2,3,4,5,6,7))
rownames(df) <- c("HLA.F", "HLA.G", "HLA.A", "HLA-F", "HLA-G", "HLA-A", "HLA-F-AS1", "HLA-E")

          cell1 cell2
HLA.F         0     0
HLA.G         1     1
HLA.A         2     2
HLA-F         3     3
HLA-G         4     4
HLA-A         5     5
HLA-F-AS1     6     6
HLA-E         7     7

I would like to essentially merge the duplicate rows only differing by either a "." or a "-" and replace it with the "." version like this.

           cell1 cell2
HLA.F         3     3
HLA.G         5     5
HLA.A         7     7
HLA-F-AS1     6     6
HLA-E         7     7

Here is what I have so far. It finds and replaces the duplicates sort of, but I don't know what to do with HLA-F/HLA.F/HLA-F-AS1 (I only want the rownames that are the same length, so HLA-F-AS1 would remain unchanged) or HLA-E which has no duplicates.

for (i in unique(rownames(df)))
{
  test <- grep(i, unique(rownames(df)))
  df.2 <- t(df[test,])
  #get duplicated names
  df.2.cols <- colnames(df.2)
  #get name with period
  test.period <- grep("[.]",df.2.cols, value = TRUE)
  #merge them
  df.2 <- apply(df.2[,df.2.cols], 1, function(x) x[!is.na(x)][1])
  df.2 <- t(as.data.frame(df.2))
  rownames(df.2) <- test.period
  #remove duplicates and replace it with merged row
  df <- df[!(row.names(df) %in% df.2.cols), ]
  df <- rbind(df, df.2)
}

Hopefully this makes sense and thanks for reading!

CodePudding user response：

Try this

library(dplyr)

df$n <- gsub("[,-] " , "." , rownames(df))
chs <- names(which(table(df$n) > 1))
df$rnm <- ifelse(gsub("[,-] " , "." , rownames(df)) %in% chs ,
       gsub("[,-] " , "." , rownames(df)) , rownames(df))

df |> group_by(rnm) |> summarise(cell1 = sum(cell1) , cell2 = sum(cell2))

df

Output

# A tibble: 5 × 3
  rnm       cell1 cell2
  <chr>     <dbl> <dbl>
1 HLA-E         7     7
2 HLA-F-AS1     6     6
3 HLA.A         7     7
4 HLA.F         3     3
5 HLA.G         5     5

CodePudding user response：

We could convert the rownames to a column, then create a grouping column without punctuation, and use that to determine which rownames were duplicated and change - to a .. Then, we can summarise and put back into rownames.

library(tidyverse)

df %>%
  rownames_to_column('id') %>%
  group_by(grp = str_remove(id, "[[:punct:]]")) %>%
  mutate(id = ifelse(n() > 1, str_replace_all(id, '-', '.'), id)) %>%
  group_by(id) %>%
  summarise(cell1 = sum(cell1, na.rm = T) ,
            cell2 = sum(cell2, na.rm = T)) %>%
  column_to_rownames("id")

Output

          cell1 cell2
HLA-E         7     7
HLA-F-AS1     6     6
HLA.A         7     7
HLA.F         3     3
HLA.G         5     5