DATE validation using regex in awk-CodePudding

How to validate date column in a file using regular expression in awk ?? My code doesn't seem to be working with awk.

my code

awk -F '|' BEGIN {OFS=FS} 
{ if 
($1 ~ /^\d{1,2}\/\d{1,2}\/\d{4} \d{1,2}.\d{1,2}.\d{1,2} [AP]M\z/)
print
}' file > file.out

file contents -

04/21/2014 02:04:55 AM|34536
12/31/2021 03:29:15 AM|87612
06-JUN-2022|09876
2022-JAN-2011 22:12:33|23120

expected output

04/21/2014 02:04:55 AM|34536
12/31/2021 03:29:15 AM|87612

CodePudding user response：

In GNU awk the \d and \z sequences are not valid regex operators (a quick web search doesn't show these as valid regex operators in a couple others flavors of awk though by no means an exhaustive search).

I'd suggest replacing the \d with [0-9] or [[:digit:]]; as for the \z you could try \> or \y.

One other issue is the use of . as a wildcard match in the time component; if you know all times will use a colon (:) as a delimiter then I'd use an explicit colon.

Rolling these changes into the current code (and fixing a couple cut-n-paste/syntax issues):

awk -F '|' 'BEGIN {OFS=FS}
{ if ($1 ~ /^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4} [0-9]{1,2}:[0-9]{1,2}:[0-9]{1,2} [AP]M\y/)
print
}'

This generates:

04/21/2014 02:04:55 AM|34536
12/31/2021 03:29:15 AM|87612

NOTES:

obviously (?) this code assumes a specific date/time format and thus ...
this code will not match on other valid date/time formats (eg, won't match on 2021/12/31)
the use of [0-9] opens you up to matching on strings that are not valid dates and/or times, eg, this code will match on 99/99/2022 and 99:99:99); OP can address some of these by limiting the series of digits that can be matched in a given position (eg, [0-2][0-9] for hours) but even this is problematic since 29 will match but is not a valid hour
as alluded to in comments ... validating dates/times is doable but will require a good bit more code (alternatively run a web search on bash awk validate dates times for additional ideas)

CodePudding user response：

Because \zasserts position at the end of the string, but in your case you have pipe symbol and I assume process ID, which means there's no match between your input string and regex.

If you are certain that this is all that the line contains, you can try and match everything after A/P M: ^\d{1,2}\/\d{1,2}\/\d{4} \d{1,2}.\d{1,2}.\d{1,2} [AP]M.*

Otherwise, match till next distinct delimiter.

CodePudding user response：

awk -F'|' '{printf (match($1, /^[0-1][0-9]\/[0-3][0-9]\/[0-9]{4}.*[AP]M$/)) ? $0"\n" : "" }' file 
04/21/2014 02:04:55 AM|34536
12/31/2021 03:29:15 AM|87612