str_extract all syntax-CodePudding

I need some help with stringr::str_extract_all

x is the name of my data frame.

V1
(A_K9B,A_K9one,A_K9two,B_U10J)

x = x %>% 
  mutate(N_alph = map_chr(str_extract_all(x$V1, 'A_([A-Z][0-10])[A-Z]'), toString))
x = x %>% 
  mutate(N_.1 = map_chr(str_extract_all(x$V1, 'A_([A-Z][0-10])[o][n][e]'), toString))
x = x %>% 
  mutate(N_.2 = map_chr(str_extract_all(x$V1, 'A_([A-Z][0-10])[t][w][o]'), toString))

This is my current output:

V1                                N_alph  N_.1     N_.2
(A_K9B,A_K9one,A_K9two,B_U10J)   A_K9B   A_K9one  A_K9two

I am fine with my column N_alph as is I want it separate from the other two. But Ideally I would like to avoid typing [o][n][e] and [t][w][o] for those variables that are followed by words rather than one alphabetical letter, if I use:

x = x %>% 
  mutate(N_alph = map_chr(str_extract_all(x$V1, 'A_([A-Z][0-10])[A-Z]'), toString))
x = x %>% 
  mutate(N_all.words = map_chr(str_extract_all(x$V1, 'A_([A-Z][0-10])[\\w ]'), toString))

Output is:

V1                                N_alph  N_all.words    
(A_K9B,A_K9one,A_K9two,B_U10J)   A_K9B   A_K9B,A_K9o,A_K9t

Desired output would be

V1                                N_alph  N_all.words    
(A_K9B,A_K9one,A_K9two,B_U10J)   A_K9B   A_K9one,A_K9two

CodePudding user response：

When you use metacharacters like \w, \b, \s, etc., you don't need the square brackets. But if you do use the square brackets than the would need to be outside. Also, the number group should be [0-9] as we are talking about individual characters, not combinations of characters. To take into account numbers higher than 9 we just expand the amount of times we check for the group with {} brackets, or simply the operator. The final result looks like so:

x %>% 
  mutate(N_all.words = str_extract_all(V1, 'A_([A-Z][0-9]{1,2})\\w '))

Resulting to:

                              V1             N_all.words
1 (A_K9B,A_K9one,A_K9two,B_U10J) A_K9B, A_K9one, A_K9two

I also created a version that I found a little tidier:

x %>% 
  mutate(N_all.words = str_extract_all(V1, 'A_\\w\\d{1,2}\\w '))