Home > front end >  Extract timestamp in file name
Extract timestamp in file name

Time:10-18

I have a vector containing file names:

Filenames = c("blabla_bli_20140524_002532_000.wav", 
"20201025_231205.wav", 
"ble_20190612_220013_012.wav", 
"X-20150312_190225_Blablu.wav", 
"0000125.wav")

Of course the real vector is longer. I need to extract the timestamp from the name so to obtain the following result in date format:

Result = c("2014-05-24 00:25:32.000", 
"2020-10-25 23:12:05.000", 
"2019-06-12 22:00:13.012", 
"2015-03-12 19:02:25.000", 
NA)

Note that sometimes characters are present before or after the timestamp, and that some file names display the miliseconds while some do not (in this case they are assumed to be 000 miliseconds). I also need to obtain something like "NA" if the file name does not have the timestamp in it.

I found what looks like a good solution here but it is in python and I don't know how to translate it in R. I tried this, which is not working:

str_extract(Filenames, regex("_((\d )_(\d ))"))
Error: '\d' is an unrecognized escape in character string starting ""_((\d"

CodePudding user response:

You can use

library(stringr)
rx <- "(\\d{4})(\\d{2})(\\d{2})_(\\d{2})(\\d{2})(\\d{2})(?:_(\\d ))?"
Filenames = c("blabla_bli_20140524_002532_000.wav", "20201025_231205.wav", "ble_20190612_220013_012.wav", "X-20150312_190225_Blablu.wav", "0000125.wav")
m <- str_match(Filenames, rx)
result <- ifelse(is.na(m[,8]), 
    str_c(m[,2], "-", m[,3], "-", m[,4], " ", m[,5], ":", m[,6], ":", m[,7], ".000"),
    str_c(m[,2], "-", m[,3], "-", m[,4], " ", m[,5], ":", m[,6], ":", m[,7], ".", m[,8]))
result

See the R demo. Output:

> result
[1] "2014-05-24 00:25:32.000"
[2] "2020-10-25 23:12:05.000"
[3] "2019-06-12 22:00:13.012"
[4] "2015-03-12 19:02:25.000"
[5] NA  

See the regex demo. It captures parts of the datetime string into separate groups. If the last group ((\d ) in the (?:_(\d ))? optional non-capturing group matching milliseconds) is matched, we add it preceded with a dot, else, we add .000.

CodePudding user response:

You can use gsub to extract your datetime part by pattern, then rely (mostly) on the parsing of datetimes by POSIX standards. The only catch is that in POSIX, milliseconds must be expressed as ss.mmm (fractional seconds), so you need to replace the _ with a .

timestamps <- gsub(".*?(([0-9]{8}_[0-9]{6})(_([0-9]{3}))?).*?$", "\\2.\\4", Filenames)
timestamps
[1] "20140524_002532.000" "20201025_231205."    "20190612_220013.012"
[4] "20150312_190225."    "0000125.wav" 

We have captured the datetime section (\\2), added a dot, then the (optional) millisecond section (\\4), without the underscore. Note that the mismatched filename remains unchanged - that's ok.

Now we specify the datetime format using the POSIX specification to first parse the strings, then print them in a different format:

times <- as.POSIXct(timestamps, format="%Y%m%d_%H%M%OS")
format(times, "%Y-%m-%d %H:%M:%OS3")
[1] "2014-05-24 00:25:32.000" "2020-10-25 23:12:05.000" "2019-06-12 22:00:13.012"
[4] "2015-03-12 19:02:25.000" NA 

Note that the string that was not a timestamp just got turned into NA, so you can easily get rid of it here. Also, if you want to see the milliseconds printed out, you seed to use the %OS (fractional seconds) format, plus the number of digits (0-6) you want printed - in this case, %OS3.

  • Related