This code is having trouble extracting complete names from between academic degrees, for example, Dr. Richard, MM or Dr. Bobby Richard Klaus, MM or Richar, MM. The academic degrees is not only Dr but also Dr., Dra., Prof., Drs, Prof. Dr., M.Ag and ME.
The output would be like this
The Goal Result
Complete Names | Names (?) |
---|---|
Dr. RICHARD, MM | Richard |
Dra. BOBBY Richard Klaus, MM | Bobby Richard Klaus |
Richard, MM | Richard |
but actually, the result is expected to like this
Actual Result
Complete Names | Names |
---|---|
Dr. Richard, MM | Richard |
Dra. Bobby Richard Klaus, MM | Richard Klaus |
Richard, MM | Richard, MM |
with this code
def extract_names(text):
""" fix capitalize """
text = re.sub(r"(_|-) "," ", text).title()
""" find name between whitespace and comma """
text = re.findall("\s[A-Z]\w (?:\s[A-Z]\w ?)?\s(?:[A-Z]\w ?)?[\s\.\,\;\:]", text)
text = ' '.join(text[0].split(","))
then there is another problem, error
11 text = ' '.join(text[0].split(",")) 12 return text 13 # def extract_names(text):
IndexError: list index out of range
CodePudding user response:
You can use
ads = r'(?:Dr[sa]?|Prof|M\.Ag|M[EM])\.?'
result = re.sub(fr'^(?:\s*{ads}) \s*|\s*,(?:\s*{ads}) $', '', text, flags=re.I)
See the regex demo.
The (?:Dr[sa]?|Prof|M\.Ag|M[EM])\.?
pattern matches Dr
, Drs
, Dra
, Prof
, M.Ag
, ME
, MM
optionally followed with a .
.
The ^(?:\s*{ads}) \s*|\s*,(?:\s*{ads}) $
main pattern matches
^(?:\s*{ads}) \s*
- start of string, then one or more sequences of zero or more whitespaces andads
pattern and then zero or more whitespaces|
- or\s*,
- zero or more whitespaces and a comma(?:\s*{ads})
- one or more repetitions of zero or more whitespaces andads
pattern$
- end of string