I'd appreciate some help to solve this problem.
Data - I have two dataframes (df1 and df2 below). In df2, each row is an individual subject and they may or may not have one or more associated samples in the 4 sample columns (sample_a, sample_b etc). In df2, each sample number is unique so only appears once. Most subjects only have one associated sample (some have none) so they are in sample_a column but some have more than one so they are in sample_a, sample_b etc.
Goal - For each sample_id in df1, I would like to find the sample_id in df2, then copy across some info from that subject's row ("subject" and "house") to df1 whilst also keeping the column "value" in df1. See intended outcome below.
So far/question - I tried using dplyr::left_join as below which works well for those that only have a single associated sample but its not looking in the other sample columns i.e. sample_b, sample_c. So in the "intended outcome" example below, row 3 would have NA for subject and house. Is there a way to match sample_id across multiple columns? Or is there a better alternative approach?
dplyr::left_join(df1, df2, by = c("sample_id" = "sample_a")
Example data
df1 <- data.frame(
sample_id = c(1, 2, 3, 4, 5),
value = c(4351, 342, 235, 49, 10293))
df2 <- data.frame(
subject = c("001", "002", "003", "004", "005"),
house = c("a", "b", "c", "d", "e"),
sample_a = c(1, 2, NA, 4, 5),
sample_b = c(NA, 3, NA, NA, NA),
sample_c = c(NA, NA, NA, NA, NA),
sample_d = c(NA, NA, NA, NA, NA))
Intended outcome
df3 <- data.frame(
sample_id = c(1, 2, 3, 4, 5),
value = c(4351, 342, 235, 49, 10293),
subject = c("001", "002", "002", "004", "005"),
house = c("a", "b", "b", "d","e"))
Many thanks for your help
Thom
CodePudding user response:
Get the same key value in both data frames before merging them.
A tidyverse solution below:
library(tidyr)
library(dplyr)
df2_long <- df2 %>%
pivot_longer(cols = starts_with("sample"),
values_to = "sample_id") %>%
select(-name)
left_join(df1, df2_long, by = "sample_id")
Output:
sample_id value subject house
1 1 4351 001 a
2 2 342 002 b
3 3 235 002 b
4 4 49 004 d
5 5 10293 005 e
CodePudding user response:
Here's a one-liner using reshape2::melt
and merge
.
merge(df1, reshape2::melt(df2, 1:2)[-3], by.x='sample_id', by.y='value')
# sample_id value subject house
# 1 1 4351 001 a
# 2 2 342 002 b
# 3 3 235 002 b
# 4 4 49 004 d
# 5 5 10293 005 e