How to delete text that does not start with a certain amount of numbers in dataframe column-CodePudding

I have this:

col1
1234HO
9535KU
4532SP
1
hello
xyz
1206
9530OK
23
8524US

And I need it to be this:

col1     col2    col3
1234HO   1234    HO
9535KU   9535    KU
4532SP   4532    SP
                       #these rows still need to be there


1206     1206          #keep in mind that I still want to keep this if there is 4 numbers
9530OK   9530    OK

8524US   8524    US

I tried removing it manually, but it's a bit too much work. I am not sure how to make a function that says "delete all text that does not start with 4 numbers". I would only know how to do it if they were all the same numbers, but they can be any numbers.

CodePudding user response：

Here is a dplyr approach with very low level of regex.

Input

# your df
df <- read.table(header = T, text = "
col1
1234HO
9535KU
4532SP
1
hello
xyz
1206
9530OK
23
8524US")

Empty rows

library(dplyr)

df %>% mutate(col2 = str_extract(col1, "^[0-9]{4,}"), 
              col3 = str_extract(col1, "[A-Z].*$"),
              col3 = replace_na(col3, ""),
              across(everything(), ~ifelse(grepl("^[0-9]{4}", col1), .x, print(""))))

     col1 col2 col3
1  1234HO 1234   HO
2  9535KU 9535   KU
3  4532SP 4532   SP
4                  
5                  
6                  
7    1206 1206     
8  9530OK 9530   OK
9                  
10 8524US 8524   US

Rows as NA

# if you want them to be filled with NA
df %>% mutate(col2 = str_extract(col1, "^[0-9]{4,}"), 
              col3 = str_extract(col1, "[A-Z].*$"),
              across(everything(), ~ifelse(grepl("^[0-9]{4}", col1), .x, NA)))

     col1 col2 col3
1  1234HO 1234   HO
2  9535KU 9535   KU
3  4532SP 4532   SP
4    <NA> <NA> <NA>
5    <NA> <NA> <NA>
6    <NA> <NA> <NA>
7    1206 1206 <NA>
8  9530OK 9530   OK
9    <NA> <NA> <NA>
10 8524US 8524   US

CodePudding user response：

You can use tidyr::separate and then filter.

library(dplyr)
library(tidyr)

dat %>%
  separate(col1, into = c("num", "text"), sep = "(?<=[0-9])(?=[A-Za-z])", remove = F) %>% 
  filter(!grepl("[A-Za-z]", num) & nchar(num) > 3)

    col1  num text
1 1234HO 1234   HO
2 9535KU 9535   KU
3 4532SP 4532   SP
4   1206 1206 <NA>
5 9530OK 9530   OK
6 8524US 8524   US

CodePudding user response：

Another possible solution:

library(tidyverse)

df <- data.frame(
  stringsAsFactors = FALSE,
  col1 = c("1234HO","9535KU",
           "4532SP","1","hello","xyz","1206","9530OK","23",
           "8524US")
)

df %>% 
  separate(col1, into=str_c("col", 2:3), sep="(?<=\\d{4})",
     remove = F, fill = "right") %>% filter(!is.na(col3))

#>     col1 col2 col3
#> 1 1234HO 1234   HO
#> 2 9535KU 9535   KU
#> 3 4532SP 4532   SP
#> 4   1206 1206     
#> 5 9530OK 9530   OK
#> 6 8524US 8524   US