Hello I'm trying to figure out how to merge rows (and sum the columns) of a data frame by prefix matching:
Example dataframe:
set.seed(42) ## for sake of reproducibility
df <- data.frame(col1=c(sprintf("gene%s", 1:3), sprintf("protein%s", 1:5), sprintf("lipid%s", 1:3)),
counts=runif(11, min=10, max=70))
df
# col1 counts
# 1 gene1 64.88836
# 2 gene2 66.22452
# 3 gene3 27.16837
# 4 protein1 59.82686
# 5 protein2 48.50473
# 6 protein3 41.14576
# 7 protein4 54.19530
# 8 protein5 18.08000
# 9 lipid1 49.41954
# 10 lipid2 52.30389
# 11 lipid3 37.46451
So I want that all the rows that starts with "gene" are merged into a single row and the same with the protein and lipid rows.
Desired output:
col1 counts
gene 158.2813
lipid 139.1879
protein 221.7526
CodePudding user response:
gsub
the numbers away, then aggregate
using formula
aggregate(counts ~ gsub('\\d ', '', col1), df, sum)
# gsub("\\\\d", "", col1) counts
# 1 gene 158.2813
# 2 lipid 139.1879
# 3 protein 221.7526
or list
notation.
with(df, aggregate(list(counts=counts), list(col1=gsub('\\d ', '', col1)), sum))
# col1 counts
# 1 gene 158.2813
# 2 lipid 139.1879
# 3 protein 221.7526
Sidenote on string generation: You also may use paste0
to suffix with numbers.
paste0("gene", 1:3)
# [1] "gene1" "gene2" "gene3"
Data:
df <- structure(list(col1 = c("gene1", "gene2", "gene3", "protein1",
"protein2", "protein3", "protein4", "protein5", "lipid1", "lipid2",
"lipid3"), counts = c(64.8883626097813, 66.2245247978717, 27.1683720871806,
59.8268575640395, 48.5047311335802, 41.145756947808, 54.195298878476,
18.0799958342686, 49.4195374241099, 52.303887042217, 37.4645065749064
)), class = "data.frame", row.names = c(NA, -11L))
CodePudding user response:
df %>%
group_by(col1 = str_remove(col1, "\\d "))%>%
summarise(counts = sum(counts))