Home > Software engineering >  Create a new dataset based on the observations of the one and the column of the other
Create a new dataset based on the observations of the one and the column of the other

Time:06-19

I am trying to solve an issue that I face... lets say I have these two datasets, one drugs with their gene targets and the other patient data with the cancer genes:

docetaxel   carboplatin
AR      ACVR1C
CYP3A4  ALDH1A1
ESR1    ALDH3A1
NR3C1   ARFGEF1
TACR2   BAX
TDP1    NOTCH/MAST
TP53    BCL2L12
TUBA1A  BIRC2
TUBA1B  BRCA1

and

Patient ID  Breastcancergenes
1   ESR1-CCDC170
2   ESR1
3   BCL2L14-ETV6
4   ETV6-NTRK3
5   MYB-NFIB
6   NOTCH/MAST
7   ESR1
8   ESR1-CCDC170
9   BCL2L14-ETV6
10  MYB-NFIB

I would like to run a function in r that will create a new dataset (3rd one) that will learn from the observations of the patient data set and match with the drug of the other data set. for example like this below:

patient id breastcancergenes drug
2            ESR1            docetaxel
4            BAX             NA
6            NOTCH/MAST      carboplatin

I tried the left_join function but this didnt work for me so much. Has anyone any idea that could help me. Thank you very much!

CodePudding user response:

Assuming the first dataframe is called df1, then the second df2 to get the third one you need to join them using left_join but before that, you need first to use pivot_longer to make df1 have drug-gene pairs at each row, instead of the current structure:

library(tidyverse)
df1 <- df1 %>% 
   pivot_longer(cols=everything(),names_to = 'drug',values_to='gene')

Now, the two data frames can be joined:

df3 <- df2 %>% left_join(df1, by=c('Breastcancergenes'='gene'))

CodePudding user response:

Assuming they need to be exact matches, then maybe this approach. pivot_longer on the first dataset creates a gene-drug reference. Then one can left_join patient and drug data by gene.

The question says "learn", but I assume it's a matching rather than a machine learning question?

library(tidyverse)

drug_df <- tribble(
  ~docetaxel, ~carboplatin,
  "AR", "ACVR1C",
  "CYP3A4", "ALDH1A1",
  "ESR1", "ALDH3A1",
  "NR3C1", "ARFGEF1",
  "TACR2", "BAX",
  "TDP1", "NOTCH/MAST",
  "TP53", "BCL2L12",
  "TUBA1A", "BIRC2",
  "TUBA1B", "BRCA1"
) |> 
  pivot_longer(everything(), names_to = "drug", values_to = "gene")

patient_df <- tribble(
  ~Patient_ID, ~Breastcancergenes,
  1, "ESR1-CCDC170",
  2, "ESR1",
  3, "BCL2L14-ETV6",
  4, "ETV6-NTRK3",
  5, "MYB-NFIB",
  6, "NOTCH/MAST",
  7, "ESR1",
  8, "ESR1-CCDC170",
  9, "BCL2L14-ETV6",
  10, "MYB-NFIB"
)

patient_df |> 
  left_join(drug_df, by = c("Breastcancergenes" = "gene"))
#> # A tibble: 10 × 3
#>    Patient_ID Breastcancergenes drug       
#>         <dbl> <chr>             <chr>      
#>  1          1 ESR1-CCDC170      <NA>       
#>  2          2 ESR1              docetaxel  
#>  3          3 BCL2L14-ETV6      <NA>       
#>  4          4 ETV6-NTRK3        <NA>       
#>  5          5 MYB-NFIB          <NA>       
#>  6          6 NOTCH/MAST        carboplatin
#>  7          7 ESR1              docetaxel  
#>  8          8 ESR1-CCDC170      <NA>       
#>  9          9 BCL2L14-ETV6      <NA>       
#> 10         10 MYB-NFIB          <NA>

Created on 2022-06-18 by the reprex package (v2.0.1)

  •  Tags:  
  • r
  • Related