Home > Software design >  Filter file on dates that could be anywhere in a particular column
Filter file on dates that could be anywhere in a particular column

Time:12-23

Let's say I have a file with two columns:

blahblah2020-02-03_moreblah | VALUE |
blah2021-03-04blah | VALUE |

Using awk I need to select just those rows where the date in the first column is less than some other date I have. The annoying thing is the date could be among any weird strings on either side, or none at all - but it will be of the format YYYY-mm-dd. I'm not sure how I ended up in a situation where I have to use awk to this but here I am and I'm very thankful in advance!

CodePudding user response:

Assumptions:

  • dates will always be of the format YYYY-MM-DD (confirmed in OP's description)
  • any dates of interest will only reside in the 1st |-delimited field
  • the 1st field will only contain at most one date string (ie, don't have to worry about the 1st field containing more than one date string)

Using GNU awk 4.0 (or newer) for FPAT support:

awk -v testdt="${dt}" '                                        # pass bash variable "dt" in as awk variable "testdt"
BEGIN { FPAT="[12][0-9]{3}-[012][0-9]-[0123][0-9]"             # define pattern we are looking for; if exists it should be field #1
#       FPAT="[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}"    # one of a few alternatives
      }

$1 < testdt                                                    # if we have a match for FPAT and less than testdt then echo entire line to stdout
' input.dat

NOTE: if the input could have data of the format ####-##-## that aren't valid dates then OP may need to tweak the FPAT defintion and/or add more logic to validate a match as an actual date before running the test ($1 < testdt)

Using OP's 2-line sample input here are some results using different values for the (bash) variable dt:

$ dt='2019-06-01'
$ awk -v testdt="${dt}" 'BEGIN {FPAT="[12][0-9]{3}-[012][0-9]-[0123][0-9]"} $1 < testdt' input.dat
       -- no output --

$ dt='2020-06-01'
$ awk -v testdt="${dt}" 'BEGIN {FPAT="[12][0-9]{3}-[012][0-9]-[0123][0-9]"} $1 < testdt' input.dat
blahblah2020-02-03_moreblah | VALUE |

$ dt='2021-06-01'
$ awk -v testdt="${dt}" 'BEGIN {FPAT="[12][0-9]{3}-[012][0-9]-[0123][0-9]"} $1 < testdt' input.dat
blahblah2020-02-03_moreblah | VALUE |
blah2021-03-04blah | VALUE |

CodePudding user response:

\d\d\d\d-\d\d-\d\d https://regexone.com/ it works but there are better solutions if u want it less than write a script usually in pyhton with this regex syntax by collecting all dates and then filtering based on position before - whether thats greater than or less than the date u have. for i in range date: if date[i] < regex

CodePudding user response:

Using any awk, in any shell, on every Unix box:

$ awk -v tgt='2020-05-01' 'match($0,/[0-9]{4}(-[0-9]{2}){2}/) && (substr($0,RSTART,RLENGTH) < tgt)' file
blahblah2020-02-03_moreblah | VALUE |

$ awk -v tgt='2021-05-01' 'match($0,/[0-9]{4}(-[0-9]{2}){2}/) && (substr($0,RSTART,RLENGTH) < tgt)' file
blahblah2020-02-03_moreblah | VALUE |
blah2021-03-04blah | VALUE |
  • Related