How to extract the values of interest in a string-CodePudding

In one column of my dataframe there are several namings for the same object.

Say, for example, that I'm working with types of cancer. There are several sub-specifications for each type of cancer.

type <- c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)", "Breast (ER- / PR- / HER2- / AR - / EGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)", "Breast (ER- / PR- / HER2- / PDL1 - / AR -)", "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))", "Breast (ER- / PR- / HER2-)", "Breast (ER- / PR- / HER2- / PD-L1 -)", "Breast (ER- / PR  / HER2 -)")

dat <- as.data.frame(type)
dat

So we have:

Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)                
Breast (ER- / PR- / HER2- / AR - / EGFR -)              
Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)               
Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)                
Breast (ER- / PR- / HER2- / PDL1 - / AR -)              
Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))             
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2- / PD-L1 -)                
Breast (ER- / PR  / HER2 -)

It may not look like it, but we have only two different types of cancer here, which are Breast (ER- / PR- / HER2-) and Breast (ER- / PR / HER2-).

Of course I have many more rows, this is just a substract, so I would like to develop a function that allows me to count how many of each type I have, that is, paying attention to ER, PR and HER2 values.

For this, I though of creating a function that captures the string composed of Breast (ER\s PR\s HER2\s, where \s is any possible value (the reason to separate them is that as you may see these three values not always follow each other).

But I didn't find a way of doing this with gsub.

Edit:

At the end I would like to get another column that would look like:

Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR  / HER2-)

This would allow me to count with a unique() function

CodePudding user response：

The `stringr` package is your friend

Stringr is a package that's part of the tidyverse and provides a lot of helpful wrappers to make handling strings more intuitive.

There are a few steps needed here, so I'll show you the output after each step

type <- c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)", "Breast (ER- / PR- / HER2- / AR - / EGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)", "Breast (ER- / PR- / HER2- / PDL1 - / AR -)", "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))", "Breast (ER- / PR- / HER2-)", "Breast (ER- / PR- / HER2- / PD-L1 -)", "Breast (ER- / PR  / HER2 -)")

dat <- as.data.frame(type)
intersect, setdiff, setequal, union

## Get rid of all spaces
mutate(dat, simple_type = str_remove_all(type, "[:space:]")) |> 
  pull(simple_type) # just using pull to show you where we're up to with the process
#> [1] "Breast(ER-/PR-/EGFR-/AR-/PD-L1-/HER2-)"            
#> [2] "Breast(ER-/PR-/HER2-/AR-/EGFR-)"                   
#> [3] "Breast(ER-/PR-/HER2-/BRCA-/PDL11%/FGFR-)"          
#> [4] "Breast(ER-/PR-/HER2-/BRCA-/PDL12%)"                
#> [5] "Breast(ER-/PR-/HER2-/PDL1-/AR-)"                   
#> [6] "Breast(ER-/PR-/HER2-/PD-L150%(BreastandIC5%liver))"
#> [7] "Breast(ER-/PR-/HER2-)"                             
#> [8] "Breast(ER-/PR-/HER2-/PD-L1-)"                      
#> [9] "Breast(ER-/PR /HER2-)"

## Extract a list of the codes we're interested it
mutate(dat,
       simple_type = str_remove_all(type, "[:space:]") |>
         str_extract_all("(ER[ -])|(PR[ -])|(HER2[ -])")) |>  ## extract all instances of 'ER' and one of  /-, OR PR and one of  /-, etc.
  pull(simple_type)
#> [[1]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[2]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[3]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[4]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[5]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[6]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[7]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[8]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[9]]
#> [1] "ER-"   "PR "   "HER2-"

## Collapse each list element into a single string, then turn the list into a character vector
### (saving this new df as 'dat' because it makes the next step much easier to write)
dat <- 
  mutate(dat,
       simple_type = str_remove_all(type, "[:space:]") |>
         str_extract_all("(ER[ -])|(PR[ -])|(HER2[ -])") |>
         lapply(str_c, collapse = " / ") |> # stringr::str_c() is pretty much identical to base::paste()
         as.character())
dat[["simple_type"]]
#> [1] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR- / HER2-"
#> [4] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR- / HER2-"
#> [7] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR  / HER2-"

## Paste back in the other stuff
dat <- mutate(dat, simple_type = str_c("Breast (", simple_type, ")"))
dat[["simple_type"]]
#> [1] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [3] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [5] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [7] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [9] "Breast (ER- / PR  / HER2-)"

^{Created on 2022-05-27 by the reprex package (v2.0.1)}

CodePudding user response：

If I do not missunderstand something, the cancer types can be distinguished unambiguously by /- after "PR". Then, would this be an option for you?

Note. Probably, the way you deal with the pattern (accounting for whitespaces/typos) needs to be more advanced. I am bad in regular expressions.

library(stringr)
df$type <- as.factor(str_sub(df$cancer, 14, 17))
table(df$type)
#> 
#>  PR-  PR  
#>    8    1

^{Created on 2022-05-27 by the reprex package (v2.0.1)}

Data

df <- data.frame(cancer = c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)",                
                            "Breast (ER- / PR- / HER2- / AR - / EGFR -)",              
                            "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)",               
                            "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)",                
                            "Breast (ER- / PR- / HER2- / PDL1 - / AR -)",             
                            "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))",            
                            "Breast (ER- / PR- / HER2-)",            
                            "Breast (ER- / PR- / HER2- / PD-L1 -)",               
                            "Breast (ER- / PR  / HER2 -)")
)

CodePudding user response：

Here is a variant of stringr approach. First, extract the ER, PR, and HER2 components, including the sign ( /-). The line starting with across() removes the spaces between the spaces between the texts and the sign. Then put the three components together in a column.

dat <- read.delim(text = text)

library(tidyverse)

dat |> 
  mutate(er = str_extract(type, "ER\\s*[ -]"),
         pr = str_extract(type, "PR\\s*[ -]"),
         her2 = str_extract(type, "HER2\\s*[ -]"),
         across(c(er, pr, her2), ~ str_remove(., "\\s")),
         type1 = paste0("Breast (", er, " / ", pr, " / ", her2, ")")) |> 
  select(-c(er, pr, her2))
 
#                                                             type                      type1
#           Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-) Breast (ER- / PR- / HER2-)
#                     Breast (ER- / PR- / HER2- / AR - / EGFR -) Breast (ER- / PR- / HER2-)
#          Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -) Breast (ER- / PR- / HER2-)
#                   Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%) Breast (ER- / PR- / HER2-)
#                     Breast (ER- / PR- / HER2- / PDL1 - / AR -) Breast (ER- / PR- / HER2-)
#Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver)) Breast (ER- / PR- / HER2-)
#                                     Breast (ER- / PR- / HER2-) Breast (ER- / PR- / HER2-)
#                           Breast (ER- / PR- / HER2- / PD-L1 -) Breast (ER- / PR- / HER2-)
#                                    Breast (ER- / PR  / HER2 -) Breast (ER- / PR  / HER2-)

Data:

text <- "type
Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)
Breast (ER- / PR- / HER2- / AR - / EGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)
Breast (ER- / PR- / HER2- / PDL1 - / AR -)
Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2- / PD-L1 -)
Breast (ER- / PR  / HER2 -)"

The stringr package is your friend

The `stringr` package is your friend