I'm searching for exact course codes in a text. Codes look like this
MAT1051
CMP1401*
PHY1001*
MAT1041*
ENG1003*
So 3 or 4 uppercase letters followed by 4 digits.
I only want ones that do not end with "*" symbol.
I have tried
course_code = re.compile('[A-Z]{4}[0-9]{4}|[A-Z]{3}[0-9]{4}')
which is probably one of the worse ways to do it but kinda works as I can get all the courses listed above. The issue is I don't want those 3 course codes ending with a "*" (failed courses have a * next to their codes) to be included in the list.
I tried adding \w or $ to the end of the expression. Whichever I add, the code returns an empty list.
CodePudding user response:
If I read your requirements correctly, you want this pattern:
^[A-Z]{3,4}[0-9]{4}$
This assumes that you would be searching your entire text stored in a Python string using regex in multiline mode, q.v. this demo:
inp = """MAT1051
CMP1401*
PHY1001*
MAT1041*
ENG1003*"""
matches = re.findall(r'^[A-Z]{3,4}[0-9]{4}$', inp, flags=re.M)
print(matches) # ['MAT1051']
CodePudding user response:
import re
# Add a "$" at the end of the re.
# It requires the match to end after the 4 digits.
course_code = re.compile('[A-Z]{4}[0-9]{4}$|[A-Z]{3}[0-9]{4}$')
# No match here
m = re.match(course_code, "MAT1051*")
print(m)
# This matches
m = re.match(course_code, "MAT1051")
print(m)