Home > Mobile >  how can i extract dates by using grep? I need to match " , ", " / ", " . &q
how can i extract dates by using grep? I need to match " , ", " / ", " . &q

Time:10-21

targets: "2019,3,1", "2019,03,01", "2019.03.01", "2019-03-01", " '21/3/1"

year<-c("2019,3,1", "2019,03,01", "2019.03.01", "2019-03-01", " '21/3/1", "2019,3-1", "2019-03=01", "2019,03.01", "2019/03-01", "2019-350-01")

grep("",year,value=T)

I tried

grep("[20 ']19([,./-]0?[3])[,./-](0?[1])$",year,value=T)

but I still have "2019,3-1" "2019,03.01" "2019/03-01"

CodePudding user response:

You can try this:

year<-c("2019,3,1", "2019,03,01", "2019.03.01", "2019-03-01", " '21/3/1", "2019,3-1", "2019-03=01", "2019,03.01", "2019/03-01", "2019-350-01")

grep("\\d{2,4}([,./-])\\d{1,2}\\1{1}\\d{1,2}",year,value=T)

Detail:

  • \\d{2,4}: a digit has length range from 2 to 4 respectively year
  • ([,./-]): group character (default group 1).
  • \\d{1,2}: a digit has length range 1 or 2 respectively month
  • \\1{1}: same value as captured in Group 1 and has length 1
  • \\d{1,2}: a digit has length range 1 or 2 respectively day

enter image description here

I usually use regex101 for visualization but it doesn't have for R. There is a small modify to convert from python regex to R regex. For example in python using \d, in R using \\d.

Hope this useful.

CodePudding user response:

Unless you really need a regular expression solution, you could use the ymd() function from the lubridate package.

library(lubridate)
ymd(year)

Its output:

 [1] "2019-03-01" "2019-03-01" "2019-03-01" "2019-03-01" "2021-03-01"
 [6] "2019-03-01" "2019-03-01" "2019-03-01" "2019-03-01" NA          
Warning message:
 1 failed to parse. 

The one that failed to parse is "2019-350-01", which clearly can't be directly interpreted as a date.

CodePudding user response:

As others noted, it depends how strict you want to be about what you consider a date, but if you wish to view any symbol between numbers as demarcating between year/month/day and use regex

as.Date(gsub("[^0-9]", "/", year),format = "%Y/%m/%d"))

It converts anything but number to /, thus, gives NA for the one that leads with ' and the one with month 350

  • Related