Home > database >  Extract names between Academic Degree Variances using Regex Python
Extract names between Academic Degree Variances using Regex Python

Time:02-22

This code is having trouble extracting complete names from between academic degrees, for example, Dr. Richard, MM or Dr. Bobby Richard Klaus, MM or Richar, MM. The academic degrees is not only Dr but also Dr., Dra., Prof., Drs, Prof. Dr., M.Ag and ME.

The output would be like this

The Goal Result

Complete Names Names (?)
Dr. RICHARD, MM Richard
Dra. BOBBY Richard Klaus, MM Bobby Richard Klaus
Richard, MM Richard

but actually, the result is expected to like this

Actual Result

Complete Names Names
Dr. Richard, MM Richard
Dra. Bobby Richard Klaus, MM Richard Klaus
Richard, MM Richard, MM

with this code

def extract_names(text):
   """ fix capitalize """
   text = re.sub(r"(_|-) "," ", text).title()
   """ find name between whitespace and comma """
   text = re.findall("\s[A-Z]\w (?:\s[A-Z]\w ?)?\s(?:[A-Z]\w ?)?[\s\.\,\;\:]", text)
   text = ' '.join(text[0].split(","))

then there is another problem, error

11 text = ' '.join(text[0].split(",")) 12 return text 13 # def extract_names(text):

IndexError: list index out of range

CodePudding user response:

You can use

ads = r'(?:Dr[sa]?|Prof|M\.Ag|M[EM])\.?'
result = re.sub(fr'^(?:\s*{ads}) \s*|\s*,(?:\s*{ads}) $', '', text, flags=re.I)

See the regex demo.

The (?:Dr[sa]?|Prof|M\.Ag|M[EM])\.? pattern matches Dr, Drs, Dra, Prof, M.Ag, ME, MM optionally followed with a ..

The ^(?:\s*{ads}) \s*|\s*,(?:\s*{ads}) $ main pattern matches

  • ^(?:\s*{ads}) \s* - start of string, then one or more sequences of zero or more whitespaces and ads pattern and then zero or more whitespaces
  • | - or
  • \s*, - zero or more whitespaces and a comma
  • (?:\s*{ads}) - one or more repetitions of zero or more whitespaces and ads pattern
  • $ - end of string
  • Related