I have two dataframes, one that looks like this:
>df1
SNP Symbols
1 rs11807834 GRIN1,SETD1A
2 rs3729986 MADD,STAC3,SPI1
3 rs61937595 NDUFA4L2,STAC3,CAMK2N1
and another that looks like this
>df2
Symbol Score
1 GRIN1 167
2 SETD1A 160
3 MADD 164
4 STAC3 12
5 CAMK2N1 3
6 NDUFA4L2 0
7 SPI1 0
I want to get the Symbol with the highest score for each SNP
column. So it would look like this:
>result
SNP Symbols Highest.Score
rs11807834 GRIN1,SETD1A GRIN1
rs2600490 MADD,STAC3,SPI1 MADD
rs3729986 NDUFA4L2,STAC3,CAMK2N1 STAC3
Any suggestions how to achieve this?
df1 <- data.frame("SNP" = c("rs11807834", "rs3729986", "rs61937595" ), "Symbols" = c("GRIN1,SETD1A", "MADD,STAC3,SPI1", "NDUFA4L2,STAC3,CAMK2N1"))
df2 <- data.frame("Symbol" = c("GRIN1", "SETD1A", "MADD", "STAC3", "CAMK2N1", "NDUFA4L2", "SPI1"), "Score" = c(167, 160, 164,12,3,0,0))
CodePudding user response:
We could combine separate_rows
with a left_join
and slice_max
after grouping:
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(Symbols, sep = ",") %>%
left_join(df2, by=c("Symbols" = "Symbol")) %>%
group_by(SNP) %>%
slice_max(Score)
SNP Symbols Score
<chr> <chr> <int>
1 rs11807834 GRIN1 167
2 rs3729986 MADD 164
3 rs61937595 STAC3 12
CodePudding user response:
this is a way to do it using dplyr and grepl:
df1 <- data.frame("SNP" = c("rs11807834", "rs3729986", "rs61937595" ), "Symbols" = c("GRIN1,SETD1A", "MADD,STAC3,SPI1", "NDUFA4L2,STAC3,CAMK2N1"))
df2 <- data.frame("Symbol" = c("GRIN1", "SETD1A", "MADD", "STAC3", "CAMK2N1", "NDUFA4L2", "SPI1"),
"Score" = c(167,160,164,12,3,0,0))
library(dplyr)
result= df1%>%
rowwise()%>%
mutate(Highest.Score=df2[max(df2[grepl(paste(unlist(strsplit(Symbols,split = ",")),collapse = "|"),df2$Symbol),]$Score)==df2$Score,]$Symbol)
# > result
# # A tibble: 3 x 3
# # Rowwise:
# SNP Symbols Highest.Score
# <chr> <chr> <chr>
# 1 rs11807834 GRIN1,SETD1A GRIN1
# 2 rs3729986 MADD,STAC3,SPI1 MADD
# 3 rs61937595 NDUFA4L2,STAC3,CAMK2N1 STAC3
CodePudding user response:
I fixed df2. I am sure my method is not the most "programatic" one and that there is a nicer way of doing this with pivot, join and mutate functions but this code works:
What the loop does, is for each row of Highest score (which was created just at the line above the loop) it filters for the symbols in that line according to df1$Symbol
. The unlist(str_split())
is to split the symbol codes into seperate string vectors (by detecting the "," pattern). then retrieving the first row only (arranged by default from the highest, otherwise can add arrange(-score)
argument to explicitly arrange it)
df1 <- data.frame(SNP = c("rs11807834", "rs3729986", "rs61937595" ),
"Symbols" = c("GRIN1,SETD1A", "MADD,STAC3,SPI1",
"NDUFA4L2,STAC3,CAMK2N1"))
# This code was corrected to match the example
df2 <- data.frame("Symbol" = c("GRIN1", "SETD1A", "MADD", "STAC3",
"CAMK2N1", "NDUFA4L2", "SPI1"),
Score=c(167,160,164,12,3,0,0))
df3 <- df1
df3$Highest.Score <- NA
for(i in 1:nrow(df1)){
df3$Highest.Score[i] <- df2 %>% filter(Symbol %in%
unlist(str_split(df1$Symbols[i],","))) %>%
slice_head(n=1) %>% pull(Symbol)
}
print(df3)
Output:
> print(df3)
SNP Symbols Highest.Score
1 rs11807834 GRIN1,SETD1A GRIN1
2 rs3729986 MADD,STAC3,SPI1 MADD
3 rs61937595 NDUFA4L2,STAC3,CAMK2N1 STAC3
CodePudding user response:
Another option with str_split
and left_join
inside a user defined function:
maxScore <- function(x){
data.frame(Symbol = str_split(x, ",") %>%
unlist()
) %>%
left_join(df2, by = c("Symbol")) %>%
select(Score) %>%
max()
}
df1 %>% rowwise() %>% mutate(MS = maxScore(Symbols))