Home > Mobile >  Compare two dataframes and retrieve values
Compare two dataframes and retrieve values

Time:11-17

I have two dataframes, one that looks like this:

>df1
             SNP                Symbols
    1 rs11807834          GRIN1,SETD1A
    2  rs3729986      MADD,STAC3,SPI1
    3 rs61937595 NDUFA4L2,STAC3,CAMK2N1

and another that looks like this

>df2

   Symbol    Score
1   GRIN1      167
2  SETD1A      160
3    MADD      164
4   STAC3       12
5 CAMK2N1        3
6 NDUFA4L2       0
7   SPI1         0

I want to get the Symbol with the highest score for each SNP column. So it would look like this:

>result
    
          SNP                 Symbols        Highest.Score   
   rs11807834            GRIN1,SETD1A                GRIN1                                
    rs2600490          MADD,STAC3,SPI1                MADD                          
    rs3729986   NDUFA4L2,STAC3,CAMK2N1               STAC3                       

Any suggestions how to achieve this?

df1 <- data.frame("SNP" = c("rs11807834", "rs3729986", "rs61937595" ), "Symbols" = c("GRIN1,SETD1A", "MADD,STAC3,SPI1", "NDUFA4L2,STAC3,CAMK2N1"))

df2 <- data.frame("Symbol" = c("GRIN1", "SETD1A", "MADD", "STAC3", "CAMK2N1", "NDUFA4L2", "SPI1"), "Score" = c(167, 160, 164,12,3,0,0))

CodePudding user response:

We could combine separate_rows with a left_join and slice_max after grouping:

library(dplyr)
library(tidyr)

df1 %>% 
  separate_rows(Symbols, sep = ",") %>% 
  left_join(df2, by=c("Symbols" = "Symbol")) %>% 
  group_by(SNP) %>% 
  slice_max(Score)
  SNP        Symbols Score
  <chr>      <chr>   <int>
1 rs11807834 GRIN1     167
2 rs3729986  MADD      164
3 rs61937595 STAC3      12

CodePudding user response:

this is a way to do it using dplyr and grepl:

df1 <- data.frame("SNP" = c("rs11807834", "rs3729986", "rs61937595" ), "Symbols" = c("GRIN1,SETD1A", "MADD,STAC3,SPI1", "NDUFA4L2,STAC3,CAMK2N1"))


df2 <- data.frame("Symbol" = c("GRIN1", "SETD1A", "MADD", "STAC3", "CAMK2N1", "NDUFA4L2", "SPI1"),
                  "Score" = c(167,160,164,12,3,0,0))

library(dplyr)
result= df1%>%
rowwise()%>%
  mutate(Highest.Score=df2[max(df2[grepl(paste(unlist(strsplit(Symbols,split = ",")),collapse = "|"),df2$Symbol),]$Score)==df2$Score,]$Symbol)

# > result
# # A tibble: 3 x 3
# # Rowwise: 
# SNP        Symbols                Highest.Score
# <chr>      <chr>                  <chr>        
#    1 rs11807834 GRIN1,SETD1A           GRIN1        
# 2 rs3729986  MADD,STAC3,SPI1        MADD         
# 3 rs61937595 NDUFA4L2,STAC3,CAMK2N1 STAC3 

CodePudding user response:

I fixed df2. I am sure my method is not the most "programatic" one and that there is a nicer way of doing this with pivot, join and mutate functions but this code works:

What the loop does, is for each row of Highest score (which was created just at the line above the loop) it filters for the symbols in that line according to df1$Symbol. The unlist(str_split()) is to split the symbol codes into seperate string vectors (by detecting the "," pattern). then retrieving the first row only (arranged by default from the highest, otherwise can add arrange(-score) argument to explicitly arrange it)

df1 <- data.frame(SNP = c("rs11807834", "rs3729986", "rs61937595" ),
                  "Symbols" = c("GRIN1,SETD1A", "MADD,STAC3,SPI1",
                              "NDUFA4L2,STAC3,CAMK2N1"))

# This code was corrected to match the example
df2 <- data.frame("Symbol" = c("GRIN1", "SETD1A", "MADD", "STAC3",
                               "CAMK2N1", "NDUFA4L2", "SPI1"),
                  Score=c(167,160,164,12,3,0,0))

df3 <- df1
df3$Highest.Score <- NA
for(i in 1:nrow(df1)){
  
 df3$Highest.Score[i] <-  df2 %>% filter(Symbol %in% 
                   unlist(str_split(df1$Symbols[i],","))) %>% 
    slice_head(n=1) %>% pull(Symbol)
}

print(df3)
Output:
> print(df3)
         SNP                Symbols Highest.Score
1 rs11807834           GRIN1,SETD1A         GRIN1
2  rs3729986        MADD,STAC3,SPI1          MADD
3 rs61937595 NDUFA4L2,STAC3,CAMK2N1         STAC3

CodePudding user response:

Another option with str_split and left_join inside a user defined function:

maxScore <- function(x){
    data.frame(Symbol = str_split(x, ",") %>% 
    unlist()
    ) %>% 
    left_join(df2, by = c("Symbol")) %>% 
    select(Score) %>% 
    max()
}

df1 %>% rowwise() %>% mutate(MS = maxScore(Symbols))
  • Related