I'm currently cleaning a dataset with a column of strings containing delimiters of various types like #
?
|
;
/
I'm looking to capture any date in the string that is not of the format 2022-09-19T07:20:00
(The years in my dataset are usually of the format 20XX, like 2022 or 2023)
How do I do capture these outliers without writing a complex regex?
Here's an example of an outlier 5002522-03-04T01:03:00
Here's a sample string:
0/0/Just/Some/2022-07-06T17:05:00/2022-07-06T19:25:00/Sample/6780/Data/in///my_Dataset
Please Advise.
CodePudding user response:
This should match the outliers based on the example provided
[\/#?:|][^\/#?:|] \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[\/#?:|]
https://regex101.com/r/6R4YMQ/1
We set the normal timestamp with \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}
We want the string to be between the delimiters [\/#?:|]
So if there are any non-delimiter characters before the timestamp [^\/#?:|]
it's a match.