I have something like this (the real one have 1,292,500 entries, 79 total columns):
Code |
---|
A005 |
A200 |
B300 |
C001 |
C999 |
D000 |
D352 |
D480 |
D501 |
D999 |
E480 |
And I need create a new column to a group some codes, i was using str_extract
to extract codes with only one letter, like A000-A999 i used:
dados$CODE_A <- str_extract(dados$CODE, "(?i)\\b(?:A)\\W*\\d ")
but now I need extract codes between C000-C999 and D000-D499, just like this:
Code | CODE_X | CODE_Y |
---|---|---|
A005 | ||
A200 | ||
B300 | ||
C001 | C001 | |
C999 | C999 | |
D000 | D000 | |
D352 | D352 | |
D480 | D480 | |
D501 | D501 | |
D999 | D999 | |
E480 |
How i do this?
CodePudding user response:
library(stringr)
library(dplyr)
library(tidyr)
tibble(x = c("A005", "A200", "B300", "C001", "C999", "D000", "D501"))%>%
mutate(letter = str_extract(x, "[A-Z]"),
numbers = as.numeric(str_extract(x, "\\d{3}")),
answer = case_when(letter == "C" ~ x,
letter == "D" & numbers < 500 ~ x))
x letter numbers answer
<chr> <chr> <dbl> <chr>
1 A005 A 5 NA
2 A200 A 200 NA
3 B300 B 300 NA
4 C001 C 1 C001
5 C999 C 999 C999
6 D000 D 0 D000
7 D501 D 501 NA
You could then filter for !is.na(answer) for example
CodePudding user response:
You could also use regex directly like this instead:
C000-C999
dados$CODE_C <- str_extract(dados$CODE, "C[0-9][0-9][0-9]")
D000-C499
dados$CODE_D <- str_extract(dados$CODE, "D[0-4][0-9][0-9]")