I am learning how to use the R dplyr 'join' functions by doing the exercises from this course: https://github.com/uclouvain-cbio/WSBIM1207 and got stuck on the problem described below.
First, download the example dataframes used for this question:
BiocManager::install("UCLouvain-CBIO/rWSBIM1207")
Load the package:
library(rWSBIM1207)
Then in R/RStudio load the dataframe files, 'clinical2' and 'expression' by typing:
data(clinical2)
data(expression)
The task is, firstly: 'Join the expression and clinical2 tables by the patient reference, using the left_join and the right_join functions.' I did that in this way:
left_join(expression, clinical2,
by = c("patient" = "patientID"))
right_join(expression, clinical2,
by = c("patient" = "patientID"))
The second task is to explain why the results are different. I found that there are 3 more rows in the right_join output versus the left_join output. This seems odd to me given that 'clinical2' has 516 rows, whereas 'expression' has 570 rows. The 3 extra rows present in the r_join output have in common that they contain multiple NA values, which presumably represent patients found in 'clinical2' and not in 'expression'. I don't really understand what is going on here, and would be grateful for any help.
CodePudding user response:
Update Maybe not clearly explained. Here an explanation with the original data:
create left join
left <- left_join(expression, clinical2, by = c("patient" = "patientID"))
dim(expression)
[1] 570 8
dim(left)
[1] 570 10
create right join
right <- right_join(expression, clinical2,
by = c("patient" = "patientID"))
dim(expression)
[1] 570 8
dim(right)
[1] 573 10
You want to know why dim(right) is 573!
Explanation step by step:
- Definition of
right_join()
: includes all rows iny
(y
is hereclinical2
). - By doing so: there are 3 rows in
clinical2
(y) which are not inexpression
(x)
See here:
which patient occur in clinical2
but not in expression
anti_join(clinical2, expression, by=c("patientID"="patient"))
patientID gender years_at_diagnosis
<chr> <chr> <dbl>
1 TCGA-55-7284 male 74.2
2 TCGA-55-7913 female 61.2
3 TCGA-67-4679 male 69.0
Again:
right_join(expression, clinical2, by = c("patient" = "patientID"))
We start with expression
(x) (dim = 570 8) and join with clinical2
(y) (dim = 516 3)
So now what happens:
- All 570 from
expression
(x) are matched with all inclinial2
(y) (dim = 516 3) EXCEPT these 3 patientID inclinical2
TCGA-55-7284, TCGA-55-7913, TCGA-67-4679
right_join
now takes all 570 fromexpression
and adds the 3 not matchingpatientID
fromclinical2
resulting in a dim of 573 10- In contrast
left_join
:left_join()
: includes all rows in x (=expression
), So if we do
anti_join(expression, clinical2, by=c("patient"="patientID"))
We get:
# ... with 8 variables: sampleID <fct>, patient <chr>, type <chr>, A1BG <dbl>, A1CF <dbl>,
# A2BP1 <dbl>, A2LD1 <dbl>, A2ML1 <dbl>````
That means all rows are included in expression. So no additional rows will be added here:
First answer: In joining two things are important:
- from which side you start to join, e.g. which table is in first position
- Given position of tables e.g. df1, df2 which join method you apply
See this example:
library(dplyr)
library(tibble)
# add ID
iris1<- iris %>%
tibble::rowid_to_column("ID")
# add ID
mtcars1 <- mtcars %>%
tibble::rowid_to_column("ID")
dim(iris1)
# [1] 150 6
dim(mtcars1)
# [1] 32 12
# 1. iris1 is first and we start from left e.g. iris1
a <- left_join(iris1, mtcars1, by="ID")
dim(a)
# [1] 150 17
# 2. iris1 is still first, but we join from right e.g. mtcars1
b <- right_join(iris1, mtcars1, by="ID")
dim(b)
# [1] 32 17
# 3. mtcars1 is first and we join from left e.g mtcars1
a1 <- left_join(mtcars1, iris1, by="ID")
dim(a1)
# [1] 32 17
-> b = a1 e.g. right_join(iris1, mtcars1, by="ID") = left_join(mtcars1, iris1, by="ID")
https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join