Create a new dataset based on the observations of the one and the column of the other-CodePudding

I am trying to solve an issue that I face... lets say I have these two datasets, one drugs with their gene targets and the other patient data with the cancer genes:

docetaxel   carboplatin
AR      ACVR1C
CYP3A4  ALDH1A1
ESR1    ALDH3A1
NR3C1   ARFGEF1
TACR2   BAX
TDP1    NOTCH/MAST
TP53    BCL2L12
TUBA1A  BIRC2
TUBA1B  BRCA1

and

Patient ID  Breastcancergenes
1   ESR1-CCDC170
2   ESR1
3   BCL2L14-ETV6
4   ETV6-NTRK3
5   MYB-NFIB
6   NOTCH/MAST
7   ESR1
8   ESR1-CCDC170
9   BCL2L14-ETV6
10  MYB-NFIB

I would like to run a function in r that will create a new dataset (3rd one) that will learn from the observations of the patient data set and match with the drug of the other data set. for example like this below:

patient id breastcancergenes drug
2            ESR1            docetaxel
4            BAX             NA
6            NOTCH/MAST      carboplatin

I tried the left_join function but this didnt work for me so much. Has anyone any idea that could help me. Thank you very much!

CodePudding user response：

Assuming the first dataframe is called df1, then the second df2 to get the third one you need to join them using left_join but before that, you need first to use pivot_longer to make df1 have drug-gene pairs at each row, instead of the current structure:

library(tidyverse)
df1 <- df1 %>% 
   pivot_longer(cols=everything(),names_to = 'drug',values_to='gene')

Now, the two data frames can be joined:

df3 <- df2 %>% left_join(df1, by=c('Breastcancergenes'='gene'))

CodePudding user response：

Assuming they need to be exact matches, then maybe this approach. pivot_longer on the first dataset creates a gene-drug reference. Then one can left_join patient and drug data by gene.

The question says "learn", but I assume it's a matching rather than a machine learning question?

library(tidyverse)

drug_df <- tribble(
  ~docetaxel, ~carboplatin,
  "AR", "ACVR1C",
  "CYP3A4", "ALDH1A1",
  "ESR1", "ALDH3A1",
  "NR3C1", "ARFGEF1",
  "TACR2", "BAX",
  "TDP1", "NOTCH/MAST",
  "TP53", "BCL2L12",
  "TUBA1A", "BIRC2",
  "TUBA1B", "BRCA1"
) |> 
  pivot_longer(everything(), names_to = "drug", values_to = "gene")

patient_df <- tribble(
  ~Patient_ID, ~Breastcancergenes,
  1, "ESR1-CCDC170",
  2, "ESR1",
  3, "BCL2L14-ETV6",
  4, "ETV6-NTRK3",
  5, "MYB-NFIB",
  6, "NOTCH/MAST",
  7, "ESR1",
  8, "ESR1-CCDC170",
  9, "BCL2L14-ETV6",
  10, "MYB-NFIB"
)

patient_df |> 
  left_join(drug_df, by = c("Breastcancergenes" = "gene"))
#> # A tibble: 10 × 3
#>    Patient_ID Breastcancergenes drug       
#>         <dbl> <chr>             <chr>      
#>  1          1 ESR1-CCDC170      <NA>       
#>  2          2 ESR1              docetaxel  
#>  3          3 BCL2L14-ETV6      <NA>       
#>  4          4 ETV6-NTRK3        <NA>       
#>  5          5 MYB-NFIB          <NA>       
#>  6          6 NOTCH/MAST        carboplatin
#>  7          7 ESR1              docetaxel  
#>  8          8 ESR1-CCDC170      <NA>       
#>  9          9 BCL2L14-ETV6      <NA>       
#> 10         10 MYB-NFIB          <NA>

^{Created on 2022-06-18 by the reprex package (v2.0.1)}