Home > Blockchain >  Find a date with grepl in R
Find a date with grepl in R

Time:11-07

I am working in R where I have a dataframe with one character variable V1 that contains a lot of different strings. I want to find dates in format dd.mm.yy or dd.mm.yyyy in this column using grepl. My dataset includes, among other things

V1
00101230311200022
73.11.22 15:19

My date is 73.11.22. Of course it should be 03.11.22 or something like that, so I would like to first extract it and then get an error message because it’s an incorrect date.

I tried:

grepl(“[0-9]{2}.[0-9]{2}.[0-9]{2}”, x = df[,1])

but I get the positions of both rows. Thanks for any help.

CodePudding user response:

You check for a date with regex or you could try and parse the date:

dat <- tibble::tibble(V1 = c("00101230311200022", "73.11.22 15:19"))

#full regex (check for the entire date-time sequence)
grepl("\\d{2}\\.\\d{1,2}\\.\\d{1,2}\\s\\d{1,2}\\:\\d{1,2}", dat$V1)
#> [1] FALSE  TRUE

#parse the date and check if it works
!is.na(lubridate::parse_date_time(dat$V1, orders = "ymd HM", quiet = TRUE) )
#> [1] FALSE  TRUE

CodePudding user response:

. matches any character. Try this instead:

dates <- c("00101230311200022", 
           "73.11.22 15:19", 
           "03.11.2022 15:19", 
           "73.11.22222 15:19", 
           "03.11.2022")

grepl("^\\d{2}\\.\\d{2}\\.(\\d{2}|\\d{4})\\b", dates)
#> [1] FALSE  TRUE  TRUE FALSE  TRUE

Though is there a reason why not to use some well-tested date-time parser? E.g lubridate:

lubridate::parse_date_time(dates, orders = c("dmy HM", "dmy"))
#> Warning: 3 failed to parse.
#> [1] NA                        NA                       
#> [3] "2022-11-03 15:19:00 UTC" NA                       
#> [5] "2022-11-03 00:00:00 UTC"

Created on 2022-11-06 with reprex v2.0.2

CodePudding user response:

read up on what grepl() does. It provides you the position of the pattern you are looking for. Thus for a vector (of one element) you will get the position of the entry for which the pattern is found. Thus, the row-index you receive is the expected return value.

You have a string that has the date somehow burried inside. If the pattern works, then the date is the 2nd position. you could split the strings to extract the date. Note that there might be other ways or more explicit patterns needed to extract dates if the V1 is not that well-formatted as in the example.

# simulate your data frame with V1 variable
df <- tibble(V1 = c("00101230311200022 73.11.22 15:19","00101230311200022 33.11.22 15:19","00101230311200022 03.11.22 15:19" ) )

# split character string and extract 2nd part
df %>% 
  mutate(DATE = str_split(V1, pattern = " ") %>% sapply("[",2)) 

# A tibble: 3 × 2
  V1                               DATE    
  <chr>                            <chr>   
1 00101230311200022 73.11.22 15:19 73.11.22
2 00101230311200022 33.11.22 15:19 33.11.22
3 00101230311200022 03.11.22 15:19 03.11.22

You can now work on the DATE column, e.g. filter with grepl() or use a data parser. Here lubridate can be your friend.

df %>% 
  mutate(
     DATE    = str_split(V1, pattern = " ") %>% sapply("[",2)
   , IS_DATE = lubridate::dmy(DATE)
) 

# A tibble: 3 × 3
  V1                               DATE     IS_DATE   
  <chr>                            <chr>    <date>    
1 00101230311200022 73.11.22 15:19 73.11.22 NA        
2 00101230311200022 33.11.22 15:19 33.11.22 NA        
3 00101230311200022 03.11.22 15:19 03.11.22 2022-11-03
  • Related