I am trying to solve an issue that I face... lets say I have these two datasets, one drugs with their gene targets and the other patient data with the cancer genes:
docetaxel carboplatin
AR ACVR1C
CYP3A4 ALDH1A1
ESR1 ALDH3A1
NR3C1 ARFGEF1
TACR2 BAX
TDP1 NOTCH/MAST
TP53 BCL2L12
TUBA1A BIRC2
TUBA1B BRCA1
and
Patient ID Breastcancergenes
1 ESR1-CCDC170
2 ESR1
3 BCL2L14-ETV6
4 ETV6-NTRK3
5 MYB-NFIB
6 NOTCH/MAST
7 ESR1
8 ESR1-CCDC170
9 BCL2L14-ETV6
10 MYB-NFIB
I would like to run a function in r that will create a new dataset (3rd one) that will learn from the observations of the patient data set and match with the drug of the other data set. for example like this below:
patient id breastcancergenes drug
2 ESR1 docetaxel
4 BAX NA
6 NOTCH/MAST carboplatin
I tried the left_join function but this didnt work for me so much. Has anyone any idea that could help me. Thank you very much!
CodePudding user response:
Assuming the first dataframe is called df1
, then the second df2
to get the third one you need to join them using left_join
but before that, you need first to use pivot_longer
to make df1 have drug-gene pairs at each row, instead of the current structure:
library(tidyverse)
df1 <- df1 %>%
pivot_longer(cols=everything(),names_to = 'drug',values_to='gene')
Now, the two data frames can be joined:
df3 <- df2 %>% left_join(df1, by=c('Breastcancergenes'='gene'))
CodePudding user response:
Assuming they need to be exact matches, then maybe this approach. pivot_longer
on the first dataset creates a gene-drug reference. Then one can left_join
patient and drug data by gene.
The question says "learn", but I assume it's a matching rather than a machine learning question?
library(tidyverse)
drug_df <- tribble(
~docetaxel, ~carboplatin,
"AR", "ACVR1C",
"CYP3A4", "ALDH1A1",
"ESR1", "ALDH3A1",
"NR3C1", "ARFGEF1",
"TACR2", "BAX",
"TDP1", "NOTCH/MAST",
"TP53", "BCL2L12",
"TUBA1A", "BIRC2",
"TUBA1B", "BRCA1"
) |>
pivot_longer(everything(), names_to = "drug", values_to = "gene")
patient_df <- tribble(
~Patient_ID, ~Breastcancergenes,
1, "ESR1-CCDC170",
2, "ESR1",
3, "BCL2L14-ETV6",
4, "ETV6-NTRK3",
5, "MYB-NFIB",
6, "NOTCH/MAST",
7, "ESR1",
8, "ESR1-CCDC170",
9, "BCL2L14-ETV6",
10, "MYB-NFIB"
)
patient_df |>
left_join(drug_df, by = c("Breastcancergenes" = "gene"))
#> # A tibble: 10 × 3
#> Patient_ID Breastcancergenes drug
#> <dbl> <chr> <chr>
#> 1 1 ESR1-CCDC170 <NA>
#> 2 2 ESR1 docetaxel
#> 3 3 BCL2L14-ETV6 <NA>
#> 4 4 ETV6-NTRK3 <NA>
#> 5 5 MYB-NFIB <NA>
#> 6 6 NOTCH/MAST carboplatin
#> 7 7 ESR1 docetaxel
#> 8 8 ESR1-CCDC170 <NA>
#> 9 9 BCL2L14-ETV6 <NA>
#> 10 10 MYB-NFIB <NA>
Created on 2022-06-18 by the reprex package (v2.0.1)