Home > Net >  How replace all cases in columns with NA if there are more than x numbers OR more than x letters in
How replace all cases in columns with NA if there are more than x numbers OR more than x letters in

Time:03-05

I have data that looks a bit like this

col1   col2
1      "1042AZ"
2      "9523 pa"
3      "dog"
4      "New York"
5      "20000 (usa)"
6      "Outside the country"
7      "1052"

I want to keep everything that

  • is only 4 numbers
  • is only 4 numbers and two letters with any combination of spaces

I currently have this code:

df$col2 <- gsub('\\s ', '', df$col2)
df$col2 <- toupper(df$col2)
#Delete all rows that does not start with 4 numbers and make PC4 column
df <- df %>% 
  mutate(col3 = str_extract(col2, "^[0-9]{4,}"), 
         col4 = str_extract(col2, "[A-Z].*$"),
         across(c(col2,col3,col4), ~ifelse(grepl("^[0-9]{4}", col2), .x, "")))

I want this result:

col1    col2       col3   col4
1       "1042AZ"   1042   "AZ"
2       "9523PA"   9523   "PA"
3       NA         NA     NA
4       NA         NA     NA
5       NA         NA     NA
6       NA         NA     NA
7       "1052"     1052   NA

Problem is that the number in line 5 stays after running my code.

CodePudding user response:

Following your code, you can set to NA if col3 does not have 4 characters:

df %>% 
  mutate(col2 = gsub('\\s ', '', toupper(col2)),
         col3 = str_extract(col2, "^[0-9]{4,}"), 
         col4 = str_extract(col2, "[A-Z|a-z].*$"),
         across(c(col2,col3,col4), ~ ifelse(nchar(col3) == 4, .x, NA)))

  col1   col2 col3 col4
1    1 1042AZ 1042   AZ
2    2 9523PA 9523   PA
3    3   <NA> <NA> <NA>
4    4   <NA> <NA> <NA>
5    5   <NA> <NA> <NA>
6    6   <NA> <NA> <NA>
7    7   1052 1052 <NA>

data

df <- read.table(header = T, text = 'col1   col2
1      "1042AZ"
2      "9523 pa"
3      "dog"
4      "New York"
5      "20000 (usa)"
6      "Outside the country"
7      "1052"')
  • Related