I'm trying to write a regex code so that it can fit the following criteria:
- CS 1110: "Introduction to Programming"
- ENGR 1624: "Introduction to Engineering"
- BME 2220: "Biomechanics"
should all match.
- CS 20: "Introduction to CS"
- ENGR 1624: " "
- ENGR 1624: ""
should not match.
This is my code so far:
([A-Z]{2,4})\s([1000-4000]{4})(:)\s(["][a-zA-Z]*\s[a-zA-Z]*?\s[a-zA-Z]*["])
However I'm running into two problems:
- When I try to run ENGR 1624, it is not working (I assume because the [1000-4000]{4} part of my code is wrong)
- It will not work for just the one word "Biomechanics"
Can anyone help fix my code please???
CodePudding user response:
If you don't want to match an empty string between the last parenthesis, you can repeat the character class 1 or more times [a-zA-Z]
and optionally repeat a group starting with a space and again the character class.
About the notations in the pattern, the "
does not have to be between square brackets, the character class notation [1000-4000]{4}
is not a range, it repeats 4 times any of 0
1
-
and 4
A range from 1000-4000 can be written as (?:4000|[1-3][0-9]{3})
which matches either 4000 or a range from 1000 - 3999
You might update the pattern using 3 capture groups instead:
\b([A-Z]{2,4})\s(4000|[1-3][0-9]{3}):\s("[a-zA-Z] (?:\s[a-zA-Z] )*")
For example
import re
pattern = r'\b([A-Z]{2,4})\s(4000|[1-3][0-9]{3}):\s("[a-zA-Z] (?:\s[a-zA-Z] )*")'
s = ("CS 1110: \"Introduction to Programming\", ENGR 1624: \"Introduction to\n"
"Engineering\", and BME 2220: \"Biomechanics\"\n\n"
"CS 20: \"Introduction to CS\", ENGR 1624: \" \", and ENGR 1624: \"\"")
print(re.findall(pattern, s))
Output
[('CS', '1110', '"Introduction to Programming"'), ('ENGR', '1624', '"Introduction to\nEngineering"'), ('BME', '2220', '"Biomechanics"')]