Home > Software engineering >  extract tidyr function with Regex: bracket and commas as separator
extract tidyr function with Regex: bracket and commas as separator

Time:01-18

Here is my dataset:

df = data.frame(x=c(NA, "xxdsa[1,d]", "x[a,3]", "x2[a,d]", "x4[a,4]"))

df:

           x
1       <NA>
2 xxdsa[1,d]
3     x[a,3]
4    x2[a,d]
5    x4[a,4]

I want to separate the x column into 3 columns, with either the bracket or the comma as separators. The result should be:

      A    B    C
1  <NA> <NA> <NA>
2 xxdsa    1    d
3     x    a    3
4    x2    a    d
5    x4    a    4

I tried the following code but do not understand why it's not working. I really want to use the extract function from tidyr as it is quite fast (compared to the separate function for example).

df %>% tidyr::extract(x, c("A", "B","C"), "([[a-zA-Z0-9]] )\\[([[a-zA-Z0-9]] )\\,([[a-zA-Z0-9]] )\\]")

CodePudding user response:

We need to match the regex correctly - in the below code, from the start (^) of the string, capture one or more characters that are not an opening square bracket (([^\\[] )) followed by the opening square bracket (\\[ - escaped as it is metacharacter), then capture the second group that are not a comma (([^,] )), followed by the , and the last group not the closing bracket (([^\\]] )) followed by the closing bracket (\\])

library(tidyr)
extract(df, x, into = c("A", "B", "C"), "^([^\\[] )\\[([^,] ),([^\\]] )\\]")

-output

      A    B    C
1  <NA> <NA> <NA>
2 xxdsa    1    d
3     x    a    3
4    x2    a    d
5    x4    a    4

In the OP's code, just take out the a-zA-Z0-9 from the [[ and place it inside [

df %>% 
  tidyr::extract(x, c("A", "B","C"), 
   "^([a-zA-Z0-9] )\\[([a-zA-Z0-9] )\\,([a-zA-Z0-9] )\\]")
      A    B    C
1  <NA> <NA> <NA>
2 xxdsa    1    d
3     x    a    3
4    x2    a    d
5    x4    a    4

According to ?regex

‘⁠[[:alnum:]]⁠’ means ‘⁠[0-9A-Za-z]⁠’,

CodePudding user response:

I tried a different regular expression, please check

tidyr::extract(df, x,into=c('A','B','C'), regex = '(\\w.*)\\[(.*)\\,(.*)\\]', remove = F)
  • Related