Home > Blockchain >  Join dataframes if key could be in multiple columns
Join dataframes if key could be in multiple columns

Time:09-30

I'd appreciate some help to solve this problem.

Data - I have two dataframes (df1 and df2 below). In df2, each row is an individual subject and they may or may not have one or more associated samples in the 4 sample columns (sample_a, sample_b etc). In df2, each sample number is unique so only appears once. Most subjects only have one associated sample (some have none) so they are in sample_a column but some have more than one so they are in sample_a, sample_b etc.

Goal - For each sample_id in df1, I would like to find the sample_id in df2, then copy across some info from that subject's row ("subject" and "house") to df1 whilst also keeping the column "value" in df1. See intended outcome below.

So far/question - I tried using dplyr::left_join as below which works well for those that only have a single associated sample but its not looking in the other sample columns i.e. sample_b, sample_c. So in the "intended outcome" example below, row 3 would have NA for subject and house. Is there a way to match sample_id across multiple columns? Or is there a better alternative approach?

dplyr::left_join(df1, df2, by = c("sample_id" = "sample_a")

Example data

df1 <- data.frame(
  sample_id = c(1, 2, 3, 4, 5),
  value = c(4351, 342, 235, 49, 10293))

df2 <- data.frame(
  subject = c("001", "002", "003", "004", "005"),
  house = c("a", "b", "c", "d", "e"),
  sample_a = c(1, 2, NA, 4, 5),
  sample_b = c(NA, 3, NA, NA, NA),
  sample_c = c(NA, NA, NA, NA, NA),
  sample_d = c(NA, NA, NA, NA, NA))

Intended outcome

df3 <- data.frame(
  sample_id = c(1, 2, 3, 4, 5),
  value = c(4351, 342, 235, 49, 10293),
  subject = c("001", "002", "002", "004", "005"),
  house = c("a", "b", "b", "d","e"))

Many thanks for your help

Thom

CodePudding user response:

Get the same key value in both data frames before merging them.

A tidyverse solution below:

library(tidyr)
library(dplyr)
df2_long <- df2 %>% 
  pivot_longer(cols = starts_with("sample"), 
               values_to = "sample_id") %>% 
  select(-name)

left_join(df1, df2_long, by = "sample_id") 

Output:

  sample_id value subject house
1         1  4351     001     a
2         2   342     002     b
3         3   235     002     b
4         4    49     004     d
5         5 10293     005     e

CodePudding user response:

Here's a one-liner using reshape2::melt and merge.

merge(df1, reshape2::melt(df2, 1:2)[-3], by.x='sample_id', by.y='value')
#   sample_id value subject house
# 1         1  4351     001     a
# 2         2   342     002     b
# 3         3   235     002     b
# 4         4    49     004     d
# 5         5 10293     005     e
  •  Tags:  
  • r
  • Related