I have a data frame that consists of comma-separated sequences of strings. For example:
df <- data.frame(patterns = c("CCDC127, HSF1, NDUFB9", "CCDC127, EXOC3, YIF1A", "EXOC3, NDUFB9, YIF1A"))
df
patterns
1 CCDC127, HSF1, NDUFB9
2 CCDC127, EXOC3, YIF1A
3 EXOC3, NDUFB9, YIF1A
I have another data frame, where each string corresponds to numerical value. For example:
df2 <- data.frame(strings = c("CCDC127", "HSF1", "NDUFB9", "EXOC3", "YIF1A"),
scores = c(10, 11, 12, 13, 14))
df2
strings scores
1 CCDC127 10
2 HSF1 11
3 NDUFB9 12
4 EXOC3 13
5 YIF1A 14
I would like to calculate a sum of each pattern from the first data frame based on values in the second data frame. For example:
patterns sum
1 CCDC127, HSF1, NDUFB9 33
2 CCDC127, EXOC3, YIF1A 37
3 EXOC3, NDUFB9, YIF1A 39
I would appreciate any directions and help with this question.
Thank you! Olha
CodePudding user response:
You can use strsplit
and sapply
with match
:
df$sum <- sapply(strsplit(df$patterns, ", "),
function(x) sum(df2$scores[match(x, df2$strings)]))
df
#> patterns sum
#> 1 CCDC127, HSF1, NDUFB9 33
#> 2 CCDC127, EXOC3, YIF1A 37
#> 3 EXOC3, NDUFB9, YIF1A 39
CodePudding user response:
This is one where I'm sure there will be a super smart apply solution but I would do it by converting the df
table into a lookup table, then joining it on and summarising.
df %>%
mutate(patterns2 = patterns) %>%
separate(patterns2, paste("c", 1:3)) %>%
pivot_longer(cols = paste("c", 1:3)) %>%
#end of lookup creation, now join on
right_join(df2, by = c("value" = "strings" )) %>%
group_by(patterns) %>%
summarise(scores = sum(scores))
CodePudding user response:
1) Convert df2 into a named list L
suitable for use with eval
and then evaluate each expression formed by parsing pattern
after replacing comma with plus.
L <- with(df2, split(scores, strings))
transform(df, sums = sapply(parse(text = gsub(",", " ", patterns)), eval, L))
giving:
patterns sums
1 CCDC127, HSF1, NDUFB9 33
2 CCDC127, EXOC3, YIF1A 37
3 EXOC3, NDUFB9, YIF1A 39
2) Another approach is to extract the words from patterns, look them up in L from (1) and then sum.
library(gsubfn)
transform(df, sums = sapply(strapply(patterns, "\\w ", x ~ L[[x]]), sum))
CodePudding user response:
We can use tidyr::separate
and rowSums
, all inside a mutate
call:
library(dplyr)
library(tidyr)
df%>%mutate(
sum = df %>% separate(col=patterns, sep=', ', into=paste0('pattern', 1:3))%>%
rowwise()%>%
mutate(across(everything(), ~df2$scores[df2$strings==.x]))%>%
rowSums())
patterns sum
1 CCDC127, HSF1, NDUFB9 33
2 CCDC127, EXOC3, YIF1A 37
3 EXOC3, NDUFB9, YIF1A 39