R: Using regular expression to keep rows of data with 6 digits-CodePudding

mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),
                    name = c("Adam", "Jane", "TJ", "Joyce"))
> mydat
           id  name
1       372303  Adam
2       KN5232  Jane
3       231244    TJ
4  283472-3822 Joyce

In my dataset, I want to keep the rows where id is a 6 digit number. For those that contain a 6 digit number followed by - and a 4 digit number, I just want to keep the first 6.

My final data should look like this:

> mydat2
               id    name
    1       372303   Adam
    3       231244     TJ
    2       283472  Joyce

I am using the following grep("^[0-9]{6}$", c("372303", "KN5232", "231244", "283472-3822")) but this does not account for the case where I want to only keep the first 6 digits before the -.

CodePudding user response：

One method would be to split at - and then extract with filter or subset

library(dplyr)
library(tidyr)
library(stringr)
mydat %>% 
  separate_rows(id, sep = "-") %>% 
  filter(str_detect(id, '^\\d{6}$'))

-output

# A tibble: 3 × 2
  id     name 
  <chr>  <chr>
1 372303 Adam 
2 231244 TJ   
3 283472 Joyce

CodePudding user response：

You can extract the first standalone 6-digit number from each ID and then only keep the items with 6-digit codes only:

mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),name = c("Adam", "Jane", "TJ", "Joyce"))
library(stringr)
mydat$id <- str_extract(mydat$id, "\\b\\d{6}\\b")
mydat[grepl("^\\d{6}$",mydat$id),]

Output:

      id  name
1 372303  Adam
3 231244    TJ
4 283472 Joyce

The \b\d{6}\b matches 6-digit codes as standalone numbers since \b are word boundaries.

CodePudding user response：

You could also extract all 6-digit numbers with a very simple regex (\\d{6}), convert to numeric (as I would expect you would anyway) and remove NA's.

E.g.

library(dplyr)
library(stringr)

mydat |> 
  mutate(id = as.numeric(str_extract_all(id, "\\d{6}"))) |>
  na.omit()

Output:

      id  name
1 372303  Adam
3 231244    TJ
4 283472 Joyce