I have data that looks a bit like this
col1 col2
1 "1042AZ"
2 "9523 pa"
3 "dog"
4 "New York"
5 "20000 (usa)"
6 "Outside the country"
7 "1052"
I want to keep everything that
- is only 4 numbers
- is only 4 numbers and two letters with any combination of spaces
I currently have this code:
df$col2 <- gsub('\\s ', '', df$col2)
df$col2 <- toupper(df$col2)
#Delete all rows that does not start with 4 numbers and make PC4 column
df <- df %>%
mutate(col3 = str_extract(col2, "^[0-9]{4,}"),
col4 = str_extract(col2, "[A-Z].*$"),
across(c(col2,col3,col4), ~ifelse(grepl("^[0-9]{4}", col2), .x, "")))
I want this result:
col1 col2 col3 col4
1 "1042AZ" 1042 "AZ"
2 "9523PA" 9523 "PA"
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
7 "1052" 1052 NA
Problem is that the number in line 5 stays after running my code.
CodePudding user response:
Following your code, you can set to NA
if col3
does not have 4 characters:
df %>%
mutate(col2 = gsub('\\s ', '', toupper(col2)),
col3 = str_extract(col2, "^[0-9]{4,}"),
col4 = str_extract(col2, "[A-Z|a-z].*$"),
across(c(col2,col3,col4), ~ ifelse(nchar(col3) == 4, .x, NA)))
col1 col2 col3 col4
1 1 1042AZ 1042 AZ
2 2 9523PA 9523 PA
3 3 <NA> <NA> <NA>
4 4 <NA> <NA> <NA>
5 5 <NA> <NA> <NA>
6 6 <NA> <NA> <NA>
7 7 1052 1052 <NA>
data
df <- read.table(header = T, text = 'col1 col2
1 "1042AZ"
2 "9523 pa"
3 "dog"
4 "New York"
5 "20000 (usa)"
6 "Outside the country"
7 "1052"')