Here is my dataset:
df = data.frame(x=c(NA, "xxdsa[1,d]", "x[a,3]", "x2[a,d]", "x4[a,4]"))
df:
x
1 <NA>
2 xxdsa[1,d]
3 x[a,3]
4 x2[a,d]
5 x4[a,4]
I want to separate the x column into 3 columns, with either the bracket or the comma as separators. The result should be:
A B C
1 <NA> <NA> <NA>
2 xxdsa 1 d
3 x a 3
4 x2 a d
5 x4 a 4
I tried the following code but do not understand why it's not working. I really want to use the extract function from tidyr as it is quite fast (compared to the separate function for example).
df %>% tidyr::extract(x, c("A", "B","C"), "([[a-zA-Z0-9]] )\\[([[a-zA-Z0-9]] )\\,([[a-zA-Z0-9]] )\\]")
CodePudding user response:
We need to match the regex correctly - in the below code, from the start (^
) of the string, capture one or more characters that are not an opening square bracket (([^\\[] )
) followed by the opening square bracket (\\[
- escaped as it is metacharacter), then capture the second group that are not a comma (([^,] )
), followed by the ,
and the last group not the closing bracket (([^\\]] )
) followed by the closing bracket (\\]
)
library(tidyr)
extract(df, x, into = c("A", "B", "C"), "^([^\\[] )\\[([^,] ),([^\\]] )\\]")
-output
A B C
1 <NA> <NA> <NA>
2 xxdsa 1 d
3 x a 3
4 x2 a d
5 x4 a 4
In the OP's code, just take out the a-zA-Z0-9
from the [[
and place it inside [
df %>%
tidyr::extract(x, c("A", "B","C"),
"^([a-zA-Z0-9] )\\[([a-zA-Z0-9] )\\,([a-zA-Z0-9] )\\]")
A B C
1 <NA> <NA> <NA>
2 xxdsa 1 d
3 x a 3
4 x2 a d
5 x4 a 4
According to ?regex
‘[[:alnum:]]’ means ‘[0-9A-Za-z]’,
CodePudding user response:
I tried a different regular expression, please check
tidyr::extract(df, x,into=c('A','B','C'), regex = '(\\w.*)\\[(.*)\\,(.*)\\]', remove = F)