Extract specific values from two columns of a data frame-CodePudding

I have a data frame with only two columns that interest me. Those two columns contain labels that I need to extract. There are 4 labels : CR, PD, PR, SD

In the sample that I'll add now, you can see those two columns, and you'll see those 4 labels, but with some other unwanted strings like io.response or pfs.. have a look:

structure(list(`!Sample_characteristics_ch1.22` = c("duration.of.io.tx: 174", 
"io.response: PD", "io.response: PD", "duration.of.io.tx: 21", 
"io.response: PD", "duration.of.io.tx: 21", "io.response: PD", 
"io.response: PD", "io.response: PR", "duration.of.io.tx: 157", 
"io.response: PD"), `!Sample_characteristics_ch1.23` = c("io.response: PD", 
"pfs: 106", "pfs: 57", "io.response: PD", "pfs: 30", "io.response: PD", 
"pfs: 25", "pfs: 17", "pfs: 338", "io.response: SD", "pfs: 41"
)), row.names = c("Patient sample BACI139", "Patient sample BACI140", 
"Patient sample BACI142", "Patient sample BACI143", "Patient sample BACI144", 
"Patient sample BACI148", "Patient sample BACI149", "Patient sample BACI150", 
"Patient sample BACI151", "Patient sample BACI152", "Patient sample BACI153"
), class = "data.frame")

What I need

Add a new column (call it whatever you want) that contains only those 4 labels. I don't want to delete or change the original columns because I like to keep the original data untouched.

examples

You can see in the first row, the second column is io.response: PD. Hence, the new column would simply be PD.

The second row first column, has io.response: PD so the new column would also be PD at this row.

Thank you!

CodePudding user response：

This code should do what you need:

library(dplyr)
library(stringr)
df |> 
  rowwise() |> 
  mutate(newcol = str_extract(str_c(`!Sample_characteristics_ch1.22`, `!Sample_characteristics_ch1.23`), "PD|CR|PR|SD")) |>
  ungroup()

CodePudding user response：

If dplyr works for you, you could use coalesce() to get first non-NA value (if there are any). And to extract labels, a rather strict regex with look behind ( (?<=...) ) and set of labels ( (CR|PD|PR|SD) ):

library(dplyr)
library(stringr)
df %>% tibble::rownames_to_column() %>% as_tibble() %>% 
  mutate(io.response = coalesce(
    str_extract(`!Sample_characteristics_ch1.22`, "(?<=^io.response: )(CR|PD|PR|SD)$"),
    str_extract(`!Sample_characteristics_ch1.23`, "(?<=^io.response: )(CR|PD|PR|SD)$")))
#> # A tibble: 11 × 4
#>    rowname                `!Sample_characteristics_ch1.22` !Sample_cha…¹ io.re…²
#>    <chr>                  <chr>                            <chr>         <chr>  
#>  1 Patient sample BACI139 duration.of.io.tx: 174           io.response:… PD     
#>  2 Patient sample BACI140 io.response: PD                  pfs: 106      PD     
#>  3 Patient sample BACI142 io.response: PD                  pfs: 57       PD     
#>  4 Patient sample BACI143 duration.of.io.tx: 21            io.response:… PD     
#>  5 Patient sample BACI144 io.response: PD                  pfs: 30       PD     
#>  6 Patient sample BACI148 duration.of.io.tx: 21            io.response:… PD     
#>  7 Patient sample BACI149 io.response: PD                  pfs: 25       PD     
#>  8 Patient sample BACI150 io.response: PD                  pfs: 17       PD     
#>  9 Patient sample BACI151 io.response: PR                  pfs: 338      PR     
#> 10 Patient sample BACI152 duration.of.io.tx: 157           io.response:… SD     
#> 11 Patient sample BACI153 io.response: PD                  pfs: 41       PD     
#> # … with abbreviated variable names ¹`!Sample_characteristics_ch1.23`,
#> #   ²io.response

^{Created on 2023-02-01 with reprex v2.0.2}