Home > Software design >  How to extract values between 000-500 with text as a new column?
How to extract values between 000-500 with text as a new column?

Time:06-21

I have something like this (the real one have 1,292,500 entries, 79 total columns):

Code
A005
A200
B300
C001
C999
D000
D352
D480
D501
D999
E480

And I need create a new column to a group some codes, i was using str_extract to extract codes with only one letter, like A000-A999 i used: dados$CODE_A <- str_extract(dados$CODE, "(?i)\\b(?:A)\\W*\\d ") but now I need extract codes between C000-C999 and D000-D499, just like this:

Code CODE_X CODE_Y
A005
A200
B300
C001 C001
C999 C999
D000 D000
D352 D352
D480 D480
D501 D501
D999 D999
E480

How i do this?

CodePudding user response:

library(stringr)
library(dplyr)
library(tidyr)

tibble(x = c("A005", "A200",  "B300", "C001", "C999", "D000", "D501"))%>%
  mutate(letter = str_extract(x, "[A-Z]"),
         numbers = as.numeric(str_extract(x, "\\d{3}")),
         answer = case_when(letter == "C" ~ x,
                            letter == "D" & numbers < 500 ~ x))

  x     letter numbers answer
  <chr> <chr>    <dbl> <chr> 
1 A005  A            5 NA    
2 A200  A          200 NA    
3 B300  B          300 NA    
4 C001  C            1 C001  
5 C999  C          999 C999  
6 D000  D            0 D000  
7 D501  D          501 NA    

You could then filter for !is.na(answer) for example

CodePudding user response:

You could also use regex directly like this instead:

C000-C999

dados$CODE_C <- str_extract(dados$CODE, "C[0-9][0-9][0-9]")

D000-C499

dados$CODE_D <- str_extract(dados$CODE, "D[0-4][0-9][0-9]")
  • Related