I am fairly new to R. I have duplicate rows that differ only by either a "." or a "-", similar to this example.
df <- data.frame(cell1 = c(0,1,2,3,4,5,6,7), cell2 = c(0,1,2,3,4,5,6,7))
rownames(df) <- c("HLA.F", "HLA.G", "HLA.A", "HLA-F", "HLA-G", "HLA-A", "HLA-F-AS1", "HLA-E")
cell1 cell2
HLA.F 0 0
HLA.G 1 1
HLA.A 2 2
HLA-F 3 3
HLA-G 4 4
HLA-A 5 5
HLA-F-AS1 6 6
HLA-E 7 7
I would like to essentially merge the duplicate rows only differing by either a "." or a "-" and replace it with the "." version like this.
cell1 cell2
HLA.F 3 3
HLA.G 5 5
HLA.A 7 7
HLA-F-AS1 6 6
HLA-E 7 7
Here is what I have so far. It finds and replaces the duplicates sort of, but I don't know what to do with HLA-F
/HLA.F
/HLA-F-AS1
(I only want the rownames that are the same length, so HLA-F-AS1
would remain unchanged) or HLA-E
which has no duplicates.
for (i in unique(rownames(df)))
{
test <- grep(i, unique(rownames(df)))
df.2 <- t(df[test,])
#get duplicated names
df.2.cols <- colnames(df.2)
#get name with period
test.period <- grep("[.]",df.2.cols, value = TRUE)
#merge them
df.2 <- apply(df.2[,df.2.cols], 1, function(x) x[!is.na(x)][1])
df.2 <- t(as.data.frame(df.2))
rownames(df.2) <- test.period
#remove duplicates and replace it with merged row
df <- df[!(row.names(df) %in% df.2.cols), ]
df <- rbind(df, df.2)
}
Hopefully this makes sense and thanks for reading!
CodePudding user response:
Try this
library(dplyr)
df$n <- gsub("[,-] " , "." , rownames(df))
chs <- names(which(table(df$n) > 1))
df$rnm <- ifelse(gsub("[,-] " , "." , rownames(df)) %in% chs ,
gsub("[,-] " , "." , rownames(df)) , rownames(df))
df |> group_by(rnm) |> summarise(cell1 = sum(cell1) , cell2 = sum(cell2))
df
Output
# A tibble: 5 × 3
rnm cell1 cell2
<chr> <dbl> <dbl>
1 HLA-E 7 7
2 HLA-F-AS1 6 6
3 HLA.A 7 7
4 HLA.F 3 3
5 HLA.G 5 5
CodePudding user response:
We could convert the rownames to a column, then create a grouping column without punctuation, and use that to determine which rownames were duplicated and change -
to a .
. Then, we can summarise and put back into rownames.
library(tidyverse)
df %>%
rownames_to_column('id') %>%
group_by(grp = str_remove(id, "[[:punct:]]")) %>%
mutate(id = ifelse(n() > 1, str_replace_all(id, '-', '.'), id)) %>%
group_by(id) %>%
summarise(cell1 = sum(cell1, na.rm = T) ,
cell2 = sum(cell2, na.rm = T)) %>%
column_to_rownames("id")
Output
cell1 cell2
HLA-E 7 7
HLA-F-AS1 6 6
HLA.A 7 7
HLA.F 3 3
HLA.G 5 5