I would like to implement a regular expression in bash that allows me to verify a series of characteristics on a dataset. A sample is attached below:
id, date of birth, grade, explusion, serious misdemeanor
123,2005-01-01,5.36,1,1
582,1999-05-12,8.51,0,1
9274,2001-25-12,9.65,0,0
21,2006-14-05,0.53,4,1
id is required to have only 3 digits, date of birth less than 2000, minimum grade point average is 5.60 with the second decimal place being other than 0, and at least one expulsion or serious misconduct.
The result of executing the regular expression should be:
582, 1999-05-12, 8.51, 0, 1
I have tried to implement the following regular expression and it does not give me any result.
grep -E "^\d{0,3},[0-2][0-9][0-9][0-9].*,[1-5].[0-5][1-9],[1-9],[1-9]$"
Any idea?
CodePudding user response:
If it is mandatory to use grep
, would you please try:
grep -E '^[0-9]{1,3},1[0-9]{3}(-[0-9]{2}){2},(5\.[6-9][1-9]|[6-9]\.[0-9][1-9]|[1-9][0-9] \.[0-9][1-9]),([1-9][0-9]*,[0-9] |[0-9] ,[1-9][0-9]*)$' input_file
Result:
582,1999-05-12,8.51,0,1
[0-9]{1,3}
matches ifid
has 1-3 digits. (I have interpretedonly 3 digits
like that. If it means differently, tweak the regex accordingly.)1[0-9]{3}(-[0-9]{2}){2}
matches if thebirth year
is before 200 exclusive.(5\.[6-9][1-9]|[6-9]\.[0-9][1-9]|[1-9][0-9] \.[0-9][1-9])
matches ifgrade
is greater than 5.60 with the second decimal place being other than 0.([1-9][0-9]*,[0-9] |[0-9] ,[1-9][0-9]*)
matches if either or both ofexplusion
andserious misdemeanor
have non-zero value.
CodePudding user response:
Regular expressions do not understand numeric values, and they certainly do not understand boolean logic. All it knows is text. You'll need to use an actual programming language like Awk or Perl to do this.
Here's an example:
$ perl -l -a -F, -E'say if length($F[0])>3 || $F[2] < 5.60' foo.txt
123,2005-01-01,5.36,1,1
9274,2001-25-12,9.65,0,0
21,2006-14-05,0.53,4,1
This call to perl
splits apart the fields on commas, and then prints the line if the length of the first column is over 3, or the value of the third column is less than 5.60.
This is just a starting point, but this is the direction to go.