Home > Back-end >  BASH-Grep results based on OR logic CSV file
BASH-Grep results based on OR logic CSV file

Time:04-17

I have this CSV file which basically is the records from athletes, and their personal info/medals.

I need to get with only one egrep (extended regular expression) the following (I have almost everything):

  • ID has to have 9 digits and the third has to be either 0 or 3.
  • The birthday year has to be lower than 2000 and the month only october (10).
  • The height of the athlete has to be equal or greater than 1,7 (I'm struggling here). The second decimal cannot be 0.
  • It has to have won at least a medal (either gold or silver, no matter how many, but at least one), but cannot be bronze.

So far I have everything but the height thing needs some last minute change to be always true (because I don't know how to say that can be 1 meter and between 7-9 but at the same time, accept 2 meters and between 0-9). The medals, I don't know how to tell the system that if gold is greater than 0 silver can be 0 and the other way around...

\d\d[0|3]\d\d\d\d\d\d,.*[1]\d\d\d[-][1][0][-]\d\d,[1|2].[7-9][^0],\d\d,.*[0-9],[1-9],[0].*

Which returns me this:

353946547,Arthur van Doren,BEL,male,1994-10-01,1.78,74,hockey,0,1,0,
820456660,Giulia Emmolo,ITA,female,1991-10-16,1.71,67,aquatics,0,1,0,
230772998,Kelly Brazier,NZL,female,1989-10-28,1.71,70,rugby sevens,0,1,0,
713017392,Pavlo Tymoshchenko,UKR,male,1986-10-13,1.92,78,modern pentathlon,0,1,0,

But it should return this (I have basically alterned the 1 from silver to gold position for demo):

353946547,Arthur van Doren,BEL,male,1994-10-01,1.78,74,hockey,0,1,0,
820456660,Giulia Emmolo,ITA,female,1991-10-16,1.71,67,aquatics,0,1,0,
230772998,Kelly Brazier,NZL,female,1989-10-28,1.71,70,rugby sevens,0,1,0,
713017392,Pavlo Tymoshchenko,UKR,male,1986-10-13,1.92,78,modern pentathlon,0,1,0,
110156979,Lauritz Schoof,GER,male,1990-10-07,1.95,98,rowing,1,0,0,
730877927,Matthew Centrowitz,USA,male,1989-10-18,1.76,65,athletics,1,0,0,

The file is stored here:

https://github.com/jpiedehierroa/files/blob/main/athletesv2.txt

You can use this site to debug quicker the code and the file:

https://regex101.com/

Many thanks,

CodePudding user response:

I think this regex does what you're asking:

\d\d[0|3]\d\d\d\d\d\d,.*[1]\d\d\d[-][1][0][-]\d\d,(1\.[7-9]|2\.[0-9])[^0],\d\d,.*(1,1|0,1|1,0),[0-9],$

CodePudding user response:

Sample input:

$ cat medals.dat
353946547,Arthur van Doren,BEL,male,1994-10-01,1.78,74,hockey,0,1,0,
820456660,Giulia Emmolo,ITA,female,1991-10-16,1.71,67,aquatics,0,1,0,
230772998,Kelly Brazier,NZL,female,1989-10-28,1.71,70,rugby sevens,0,1,0,
713017392,Pavlo Tymoshchenko,UKR,male,1986-10-13,1.92,78,modern pentathlon,0,1,0,
110156979,Lauritz Schoof,GER,male,1990-10-07,1.95,98,rowing,1,0,0,
730877927,Matthew Centrowitz,USA,male,1989-10-18,1.76,65,athletics,1,0,0,

999946547,Arthur van Doren,BEL,male,1994-10-01,1.78,74,hockey,0,1,0,
999956660,Giulia Emmolo,ITA,female,1991-10-16,1.71,67,aquatics,0,1,0,
999972998,Kelly Brazier,NZL,female,1989-10-28,1.71,70,rugby sevens,0,1,0,
713017392,Pavlo Tymoshchenko,UKR,male,1986-08-13,1.92,78,modern pentathlon,0,1,0,
110156979,Lauritz Schoof,GER,male,1990-10-07,1.65,98,rowing,1,0,0,
730877927,Matthew Centrowitz,USA,male,1989-10-18,1.76,65,athletics,0,0,3,

NOTE: 1st 6 lines are from OP's expected output; last 6 lines are modified copies of the same lines; the last 6 lines should not show up in the output

One egrep/regex idea:

$ egrep '^[0-9]{2}[03][0-9]{6},([^,]*,){3}1...-10[^,]*,(1\.[7-9]|2\.[0-9])[0-9]*,([^,]*,){2}([^0]|[^,]*,[^0])' medals.dat
353946547,Arthur van Doren,BEL,male,1994-10-01,1.78,74,hockey,0,1,0,
820456660,Giulia Emmolo,ITA,female,1991-10-16,1.71,67,aquatics,0,1,0,
230772998,Kelly Brazier,NZL,female,1989-10-28,1.71,70,rugby sevens,0,1,0,
713017392,Pavlo Tymoshchenko,UKR,male,1986-10-13,1.92,78,modern pentathlon,0,1,0,
110156979,Lauritz Schoof,GER,male,1990-10-07,1.95,98,rowing,1,0,0,
730877927,Matthew Centrowitz,USA,male,1989-10-18,1.76,65,athletics,1,0,0,

NOTES:

  • my version of egrep doesn't appear to support \d hence the use of [0-9]
  • tallest man to ever live (so far) was 2.72m so we should be good with 2\.[0-9] (ie, no need for [23]\.[0-9])
  • assumes none of the fields of interest have leading white space
  • Related