In one column of my dataframe there are several namings for the same object.
Say, for example, that I'm working with types of cancer. There are several sub-specifications for each type of cancer.
type <- c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)", "Breast (ER- / PR- / HER2- / AR - / EGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)", "Breast (ER- / PR- / HER2- / PDL1 - / AR -)", "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))", "Breast (ER- / PR- / HER2-)", "Breast (ER- / PR- / HER2- / PD-L1 -)", "Breast (ER- / PR / HER2 -)")
dat <- as.data.frame(type)
dat
So we have:
Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)
Breast (ER- / PR- / HER2- / AR - / EGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)
Breast (ER- / PR- / HER2- / PDL1 - / AR -)
Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2- / PD-L1 -)
Breast (ER- / PR / HER2 -)
It may not look like it, but we have only two different types of cancer here, which are Breast (ER- / PR- / HER2-)
and Breast (ER- / PR / HER2-)
.
Of course I have many more rows, this is just a substract, so I would like to develop a function that allows me to count how many of each type I have, that is, paying attention to ER, PR and HER2 values.
For this, I though of creating a function that captures the string composed of Breast (ER\s
PR\s
HER2\s
, where \s
is any possible value (the reason to separate them is that as you may see these three values not always follow each other).
But I didn't find a way of doing this with gsub.
Edit:
At the end I would like to get another column that would look like:
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR / HER2-)
This would allow me to count with a unique()
function
CodePudding user response:
The stringr
package is your friend
Stringr is a package that's part of the tidyverse and provides a lot of helpful wrappers to make handling strings more intuitive.
There are a few steps needed here, so I'll show you the output after each step
type <- c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)", "Breast (ER- / PR- / HER2- / AR - / EGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)", "Breast (ER- / PR- / HER2- / PDL1 - / AR -)", "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))", "Breast (ER- / PR- / HER2-)", "Breast (ER- / PR- / HER2- / PD-L1 -)", "Breast (ER- / PR / HER2 -)")
dat <- as.data.frame(type)
intersect, setdiff, setequal, union
## Get rid of all spaces
mutate(dat, simple_type = str_remove_all(type, "[:space:]")) |>
pull(simple_type) # just using pull to show you where we're up to with the process
#> [1] "Breast(ER-/PR-/EGFR-/AR-/PD-L1-/HER2-)"
#> [2] "Breast(ER-/PR-/HER2-/AR-/EGFR-)"
#> [3] "Breast(ER-/PR-/HER2-/BRCA-/PDL11%/FGFR-)"
#> [4] "Breast(ER-/PR-/HER2-/BRCA-/PDL12%)"
#> [5] "Breast(ER-/PR-/HER2-/PDL1-/AR-)"
#> [6] "Breast(ER-/PR-/HER2-/PD-L150%(BreastandIC5%liver))"
#> [7] "Breast(ER-/PR-/HER2-)"
#> [8] "Breast(ER-/PR-/HER2-/PD-L1-)"
#> [9] "Breast(ER-/PR /HER2-)"
## Extract a list of the codes we're interested it
mutate(dat,
simple_type = str_remove_all(type, "[:space:]") |>
str_extract_all("(ER[ -])|(PR[ -])|(HER2[ -])")) |> ## extract all instances of 'ER' and one of /-, OR PR and one of /-, etc.
pull(simple_type)
#> [[1]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[2]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[3]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[4]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[5]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[6]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[7]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[8]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[9]]
#> [1] "ER-" "PR " "HER2-"
## Collapse each list element into a single string, then turn the list into a character vector
### (saving this new df as 'dat' because it makes the next step much easier to write)
dat <-
mutate(dat,
simple_type = str_remove_all(type, "[:space:]") |>
str_extract_all("(ER[ -])|(PR[ -])|(HER2[ -])") |>
lapply(str_c, collapse = " / ") |> # stringr::str_c() is pretty much identical to base::paste()
as.character())
dat[["simple_type"]]
#> [1] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR- / HER2-"
#> [4] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR- / HER2-"
#> [7] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR / HER2-"
## Paste back in the other stuff
dat <- mutate(dat, simple_type = str_c("Breast (", simple_type, ")"))
dat[["simple_type"]]
#> [1] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [3] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [5] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [7] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [9] "Breast (ER- / PR / HER2-)"
Created on 2022-05-27 by the reprex package (v2.0.1)
CodePudding user response:
If I do not missunderstand something, the cancer types can be distinguished unambiguously by /- after "PR". Then, would this be an option for you?
Note. Probably, the way you deal with the pattern (accounting for whitespaces/typos) needs to be more advanced. I am bad in regular expressions.
library(stringr)
df$type <- as.factor(str_sub(df$cancer, 14, 17))
table(df$type)
#>
#> PR- PR
#> 8 1
Created on 2022-05-27 by the reprex package (v2.0.1)
Data
df <- data.frame(cancer = c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)",
"Breast (ER- / PR- / HER2- / AR - / EGFR -)",
"Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)",
"Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)",
"Breast (ER- / PR- / HER2- / PDL1 - / AR -)",
"Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))",
"Breast (ER- / PR- / HER2-)",
"Breast (ER- / PR- / HER2- / PD-L1 -)",
"Breast (ER- / PR / HER2 -)")
)
CodePudding user response:
Here is a variant of stringr
approach. First, extract the ER, PR, and HER2 components, including the sign ( /-). The line starting with across()
removes the spaces between the spaces between the texts and the sign. Then put the three components together in a column.
dat <- read.delim(text = text)
library(tidyverse)
dat |>
mutate(er = str_extract(type, "ER\\s*[ -]"),
pr = str_extract(type, "PR\\s*[ -]"),
her2 = str_extract(type, "HER2\\s*[ -]"),
across(c(er, pr, her2), ~ str_remove(., "\\s")),
type1 = paste0("Breast (", er, " / ", pr, " / ", her2, ")")) |>
select(-c(er, pr, her2))
# type type1
# Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / AR - / EGFR -) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / PDL1 - / AR -) Breast (ER- / PR- / HER2-)
#Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver)) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2-) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / PD-L1 -) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR / HER2 -) Breast (ER- / PR / HER2-)
Data:
text <- "type
Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)
Breast (ER- / PR- / HER2- / AR - / EGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)
Breast (ER- / PR- / HER2- / PDL1 - / AR -)
Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2- / PD-L1 -)
Breast (ER- / PR / HER2 -)"