I have this CSV file which basically is the records from athletes, and their personal info/medals.
I need to get with only one egrep (extended regular expression) the following (I have almost everything):
- ID has to have 9 digits and the third has to be either 0 or 3.
- The birthday year has to be lower than 2000 and the month only october (10).
- The height of the athlete has to be equal or greater than 1,7 (I'm struggling here). The second decimal cannot be 0.
- It has to have won at least a medal (either gold or silver, no matter how many, but at least one), but cannot be bronze.
So far I have everything but the height thing needs some last minute change to be always true (because I don't know how to say that can be 1 meter and between 7-9 but at the same time, accept 2 meters and between 0-9). The medals, I don't know how to tell the system that if gold is greater than 0 silver can be 0 and the other way around...
\d\d[0|3]\d\d\d\d\d\d,.*[1]\d\d\d[-][1][0][-]\d\d,[1|2].[7-9][^0],\d\d,.*[0-9],[1-9],[0].*
Which returns me this:
353946547,Arthur van Doren,BEL,male,1994-10-01,1.78,74,hockey,0,1,0,
820456660,Giulia Emmolo,ITA,female,1991-10-16,1.71,67,aquatics,0,1,0,
230772998,Kelly Brazier,NZL,female,1989-10-28,1.71,70,rugby sevens,0,1,0,
713017392,Pavlo Tymoshchenko,UKR,male,1986-10-13,1.92,78,modern pentathlon,0,1,0,
But it should return this (I have basically alterned the 1 from silver to gold position for demo):
353946547,Arthur van Doren,BEL,male,1994-10-01,1.78,74,hockey,0,1,0,
820456660,Giulia Emmolo,ITA,female,1991-10-16,1.71,67,aquatics,0,1,0,
230772998,Kelly Brazier,NZL,female,1989-10-28,1.71,70,rugby sevens,0,1,0,
713017392,Pavlo Tymoshchenko,UKR,male,1986-10-13,1.92,78,modern pentathlon,0,1,0,
110156979,Lauritz Schoof,GER,male,1990-10-07,1.95,98,rowing,1,0,0,
730877927,Matthew Centrowitz,USA,male,1989-10-18,1.76,65,athletics,1,0,0,
The file is stored here:
https://github.com/jpiedehierroa/files/blob/main/athletesv2.txt
You can use this site to debug quicker the code and the file:
Many thanks,
CodePudding user response:
I think this regex does what you're asking:
\d\d[0|3]\d\d\d\d\d\d,.*[1]\d\d\d[-][1][0][-]\d\d,(1\.[7-9]|2\.[0-9])[^0],\d\d,.*(1,1|0,1|1,0),[0-9],$
CodePudding user response:
Sample input:
$ cat medals.dat
353946547,Arthur van Doren,BEL,male,1994-10-01,1.78,74,hockey,0,1,0,
820456660,Giulia Emmolo,ITA,female,1991-10-16,1.71,67,aquatics,0,1,0,
230772998,Kelly Brazier,NZL,female,1989-10-28,1.71,70,rugby sevens,0,1,0,
713017392,Pavlo Tymoshchenko,UKR,male,1986-10-13,1.92,78,modern pentathlon,0,1,0,
110156979,Lauritz Schoof,GER,male,1990-10-07,1.95,98,rowing,1,0,0,
730877927,Matthew Centrowitz,USA,male,1989-10-18,1.76,65,athletics,1,0,0,
999946547,Arthur van Doren,BEL,male,1994-10-01,1.78,74,hockey,0,1,0,
999956660,Giulia Emmolo,ITA,female,1991-10-16,1.71,67,aquatics,0,1,0,
999972998,Kelly Brazier,NZL,female,1989-10-28,1.71,70,rugby sevens,0,1,0,
713017392,Pavlo Tymoshchenko,UKR,male,1986-08-13,1.92,78,modern pentathlon,0,1,0,
110156979,Lauritz Schoof,GER,male,1990-10-07,1.65,98,rowing,1,0,0,
730877927,Matthew Centrowitz,USA,male,1989-10-18,1.76,65,athletics,0,0,3,
NOTE: 1st 6 lines are from OP's expected output; last 6 lines are modified copies of the same lines; the last 6 lines should not show up in the output
One egrep/regex
idea:
$ egrep '^[0-9]{2}[03][0-9]{6},([^,]*,){3}1...-10[^,]*,(1\.[7-9]|2\.[0-9])[0-9]*,([^,]*,){2}([^0]|[^,]*,[^0])' medals.dat
353946547,Arthur van Doren,BEL,male,1994-10-01,1.78,74,hockey,0,1,0,
820456660,Giulia Emmolo,ITA,female,1991-10-16,1.71,67,aquatics,0,1,0,
230772998,Kelly Brazier,NZL,female,1989-10-28,1.71,70,rugby sevens,0,1,0,
713017392,Pavlo Tymoshchenko,UKR,male,1986-10-13,1.92,78,modern pentathlon,0,1,0,
110156979,Lauritz Schoof,GER,male,1990-10-07,1.95,98,rowing,1,0,0,
730877927,Matthew Centrowitz,USA,male,1989-10-18,1.76,65,athletics,1,0,0,
NOTES:
- my version of
egrep
doesn't appear to support\d
hence the use of[0-9]
- tallest man to ever live (so far) was 2.72m so we should be good with
2\.[0-9]
(ie, no need for[23]\.[0-9]
) - assumes none of the fields of interest have leading white space