Home > Mobile >  Joining three large tables by multiple columns in R
Joining three large tables by multiple columns in R

Time:01-30

I have three tables that follow broadly the similar format:

CHR_POS CHR POS SNPID Allele1 Allele2 BETA SE p.value AF.Cases AF.Controls
10_10064415 chr10 10064415 . G A -0.325409092359287 0.206289483498678 0.114694529999742 0.0494505494505494 0.07143797301744
10_10064534 chr10 10064534 . C T 0.15791470499413 0.110984360432685 0.154778055526005 0.384615384615385 0.349391247120763
10_10064558 chr10 10064558 . T TA -0.14178643419147 0.470858147924979 0.763320353066551 0.010989010989011 0.0128989799276078
10_10064720 chr10 10064720 . C A -0.186935246879783 0.107658933163489 0.0824992455316354 0.35989010989011 0.405572008336075
10_10065002 chr10 10065002 . T TA -0.160166303710601 0.10545414459499 0.128806760138629 0.475274725274725 0.514807502467917
10_10065141 chr10 10065141 . T C -0.14178643419147 0.470858147924979 0.763320353066551 0.010989010989011 0.0128989799276078
10_10065213 chr10 10065213 . A G -0.325580365705231 0.206187190855573 0.114324052785598 0.0494505494505494 0.0715037841395196
10_10065256 chr10 10065256 . A G -0.325580365705231 0.206187190855573 0.114324052785598 0.0494505494505494 0.0715037841395196
10_10065304 chr10 10065304 . A G -0.160103269643433 0.10545894309007 0.128974736252953 0.475274725274725 0.514752659866184

I would like to reduce them so that each of the three files only has variants that are the same across all three. This would be those that share the same CHR_POS, Allele1 and Allele2.

I am not looking to merge these, I would still like three tables as an output but just subset by these three columns so that they are the same.

Many thanks

CodePudding user response:

Here's an example of three data frames:

df1 <- data.frame(
    CHR_POS = letters[1:3],
    Allele1 = letters[1:3],
    Allele2 = letters[1:3]
)

df2 <- data.frame(
    CHR_POS = letters[2:4],
    Allele1 = letters[2:4],
    Allele2 = letters[2:4]
)


df3 <- data.frame(
    CHR_POS = letters[3:5],
    Allele1 = letters[3:5],
    Allele2 = letters[3:5]
)

You can use semi_join from dplyr. This will return only the rows in the first data frame that are found in the second data frame. If you do this with both other data frames, you will find only the rows in all three.

library(dplyr)

df1 %>% 
    semi_join(df2, by = c("CHR_POS", "Allele1", "Allele2")) %>% 
    semi_join(df3)

df2 %>% 
    semi_join(df1, by = c("CHR_POS", "Allele1", "Allele2")) %>% 
    semi_join(df3, by = c("CHR_POS", "Allele1", "Allele2"))

df3 %>% 
    semi_join(df1, by = c("CHR_POS", "Allele1", "Allele2")) %>% 
    semi_join(df2, by = c("CHR_POS", "Allele1", "Allele2"))
  • Related