How to validate date column in a file using regular expression in awk ?? My code doesn't seem to be working with awk.
my code
awk -F '|' BEGIN {OFS=FS}
{ if
($1 ~ /^\d{1,2}\/\d{1,2}\/\d{4} \d{1,2}.\d{1,2}.\d{1,2} [AP]M\z/)
print
}' file > file.out
file contents -
04/21/2014 02:04:55 AM|34536
12/31/2021 03:29:15 AM|87612
06-JUN-2022|09876
2022-JAN-2011 22:12:33|23120
expected output
04/21/2014 02:04:55 AM|34536
12/31/2021 03:29:15 AM|87612
CodePudding user response:
In GNU awk
the \d
and \z
sequences are not valid regex operators (a quick web search doesn't show these as valid regex operators in a couple others flavors of awk
though by no means an exhaustive search).
I'd suggest replacing the \d
with [0-9]
or [[:digit:]]
; as for the \z
you could try \>
or \y
.
One other issue is the use of .
as a wildcard match in the time component; if you know all times will use a colon (:
) as a delimiter then I'd use an explicit colon.
Rolling these changes into the current code (and fixing a couple cut-n-paste/syntax issues):
awk -F '|' 'BEGIN {OFS=FS}
{ if ($1 ~ /^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4} [0-9]{1,2}:[0-9]{1,2}:[0-9]{1,2} [AP]M\y/)
print
}'
This generates:
04/21/2014 02:04:55 AM|34536
12/31/2021 03:29:15 AM|87612
NOTES:
- obviously (?) this code assumes a specific date/time format and thus ...
- this code will not match on other valid date/time formats (eg, won't match on
2021/12/31
) - the use of
[0-9]
opens you up to matching on strings that are not valid dates and/or times, eg, this code will match on99/99/2022
and99:99:99
); OP can address some of these by limiting the series of digits that can be matched in a given position (eg,[0-2][0-9]
for hours) but even this is problematic since29
will match but is not a valid hour - as alluded to in comments ... validating dates/times is doable but will require a good bit more code (alternatively run a web search on
bash awk validate dates times
for additional ideas)
CodePudding user response:
Because \z
asserts position at the end of the string, but in your case you have pipe symbol and I assume process ID, which means there's no match between your input string and regex.
If you are certain that this is all that the line contains, you can try and match everything after A/P M:
^\d{1,2}\/\d{1,2}\/\d{4} \d{1,2}.\d{1,2}.\d{1,2} [AP]M.*
Otherwise, match till next distinct delimiter.
CodePudding user response:
awk -F'|' '{printf (match($1, /^[0-1][0-9]\/[0-3][0-9]\/[0-9]{4}.*[AP]M$/)) ? $0"\n" : "" }' file
04/21/2014 02:04:55 AM|34536
12/31/2021 03:29:15 AM|87612