How to capture (0/32LK) (0/21x) and (3/17) in one regular expression-CodePudding

I want to clean up some TNM entries, Here is an example:

structure(list(TNM = c("pT3 N0 (0/13)", "pT3 N2b (21/45l)", "pT3 N0 (0/32 LK)"
)), class = "data.frame", row.names = c(NA, -3L))

               TNM
1    pT3 N0 (0/13)
2 pT3 N2b (21/45l)
3 pT3 N0 (0/32 LK)

So far I got this:

library(dplyr)
library(stringr)

df %>% 
  mutate(TNM = str_remove_all(TNM, '\\,|\\;|\\.'),
         TNM = str_replace_all(TNM, ' ', ''),
         TNM = str_replace_all(TNM, "x", "X")) %>% 
  mutate(N_count = str_extract(TNM, '\\(\\d \\/\\d \\)'))

            TNM N_count
1    pT3N0(0/13)  (0/13)
2 pT3N2b(21/45l)    <NA>
3  pT3N0(0/32LK)    <NA>

This works:

library(dplyr)
library(stringr)

df %>% 
  mutate(TNM = str_remove_all(TNM, '\\,|\\;|\\.'),
         TNM = str_replace_all(TNM, ' ', ''),
         TNM = str_replace_all(TNM, "x", "X")) %>% 
  mutate(N_count = str_extract(TNM, '\\(\\d \\/\\d \\)|\\(\\d \\/\\d \\w\\)|\\(\\d \\/\\d \\w \\)'))

    TNM  N_count
1    pT3N0(0/13)   (0/13)
2 pT3N2b(21/45l) (21/45l)
3  pT3N0(0/32LK) (0/32LK)

Is there a way to shorten this regex: '\\(\\d \\/\\d \\)|\\(\\d \\/\\d \\w\\)|\\(\\d \\/\\d \\w \\)'?

CodePudding user response：

In the alternation, you want to match no, a single or 1 or more word characters.

You could shorten the pattern not using the alternation and repeating optional word characters

\\(\\d /\\d \\w*\\)

Regex demo

To also match (0/32 LK) and not only trailing spaces like (21/45 ) , you can optionally match optional whitespace characters followed by 1 word characters:

\\(\\d /\\d (?:\\s*\\w )?\\)

Regex demo | R demo