Home > Software design >  How to extract names from unordered data string using regular expression in python?
How to extract names from unordered data string using regular expression in python?

Time:04-04

I need to extract the names of the people from the following sentence.


Input: BENCH: MAHAJAN, MEHR CHAND BENCH: MAHAJAN, MEHR CHAND DAS, SUDHI RANJAN BOSE, VIVIAN HASAN, GHULAM CITATION: 1953 AIR 28 1953 SCR 197

Output: MEHR CHAND MAHAJAN, MEHR CHAND MAHAJAN, SUDHI RANJAN DAS, VIVIAN BOSE, GHULAM HASAN


For extracting the name from the first part of the sentence, I used the following code.

bench = re.search('BENCH: (.*?) BENCH', contents)
if bench:
    bench = bench.group(1)
    bench = ' '.join(reversed(bench.split(",")))
    print(bench)

Output: MEHR CHAND MAHAJAN

CodePudding user response:

You could use this regex to match the names in your input data:

((?:\w ), (?:\w (?: \w )?))(?= BENCH:| CITATION:| \w ,)

This looks for a word (\w ), followed by a comma and then one or two words separated by a space (\w (?: \w )?), and then uses a forward lookahead to assert that those words must be followed by one of BENCH:, CITATION: or another word followed by a comma (\w ,).

names = re.findall(r'((?:\w ), (?:\w (?: \w )?))(?= BENCH:| CITATION:| \w ,)', contents)

For your sample data, this yields:

['MAHAJAN, MEHR CHAND', 'MAHAJAN, MEHR CHAND', 'DAS, SUDHI RANJAN', 'BOSE, VIVIAN', 'HASAN, GHULAM']

This list can then be reformatted as you desire:

names = ', '.join((map(lambda n:' '.join(n.split(', ')[-1::-1]), names)))

Output:

'MEHR CHAND MAHAJAN, MEHR CHAND MAHAJAN, SUDHI RANJAN DAS, VIVIAN BOSE, GHULAM HASAN'
  • Related