Home > Back-end >  How to merge rows with the same prefix in a data frame?
How to merge rows with the same prefix in a data frame?

Time:05-30

Hello I'm trying to figure out how to merge rows (and sum the columns) of a data frame by prefix matching:

Example dataframe:

set.seed(42)  ## for sake of reproducibility
df <- data.frame(col1=c(sprintf("gene%s", 1:3), sprintf("protein%s", 1:5), sprintf("lipid%s", 1:3)), 
                 counts=runif(11, min=10, max=70))
df
#        col1   counts
# 1     gene1 64.88836
# 2     gene2 66.22452
# 3     gene3 27.16837
# 4  protein1 59.82686
# 5  protein2 48.50473
# 6  protein3 41.14576
# 7  protein4 54.19530
# 8  protein5 18.08000
# 9    lipid1 49.41954
# 10   lipid2 52.30389
# 11   lipid3 37.46451

So I want that all the rows that starts with "gene" are merged into a single row and the same with the protein and lipid rows.

Desired output:

    col1   counts
    gene 158.2813
   lipid 139.1879
 protein 221.7526

CodePudding user response:

gsub the numbers away, then aggregate using formula

aggregate(counts ~ gsub('\\d ', '', col1), df, sum)
#   gsub("\\\\d", "", col1)   counts
# 1                    gene 158.2813
# 2                   lipid 139.1879
# 3                 protein 221.7526

or list notation.

with(df, aggregate(list(counts=counts), list(col1=gsub('\\d ', '', col1)), sum))
#      col1   counts
# 1    gene 158.2813
# 2   lipid 139.1879
# 3 protein 221.7526

Sidenote on string generation: You also may use paste0 to suffix with numbers.

paste0("gene", 1:3)
# [1] "gene1" "gene2" "gene3"

Data:

df <- structure(list(col1 = c("gene1", "gene2", "gene3", "protein1", 
"protein2", "protein3", "protein4", "protein5", "lipid1", "lipid2", 
"lipid3"), counts = c(64.8883626097813, 66.2245247978717, 27.1683720871806, 
59.8268575640395, 48.5047311335802, 41.145756947808, 54.195298878476, 
18.0799958342686, 49.4195374241099, 52.303887042217, 37.4645065749064
)), class = "data.frame", row.names = c(NA, -11L))

CodePudding user response:

df %>%
  group_by(col1 = str_remove(col1, "\\d "))%>%
  summarise(counts = sum(counts))
  • Related