I need to extract the names of the people from the following sentence.
Input: BENCH: MAHAJAN, MEHR CHAND BENCH: MAHAJAN, MEHR CHAND DAS, SUDHI RANJAN BOSE, VIVIAN HASAN, GHULAM CITATION: 1953 AIR 28 1953 SCR 197
Output: MEHR CHAND MAHAJAN, MEHR CHAND MAHAJAN, SUDHI RANJAN DAS, VIVIAN BOSE, GHULAM HASAN
For extracting the name from the first part of the sentence, I used the following code.
bench = re.search('BENCH: (.*?) BENCH', contents)
if bench:
bench = bench.group(1)
bench = ' '.join(reversed(bench.split(",")))
print(bench)
Output: MEHR CHAND MAHAJAN
CodePudding user response:
You could use this regex to match the names in your input data:
((?:\w ), (?:\w (?: \w )?))(?= BENCH:| CITATION:| \w ,)
This looks for a word (\w
), followed by a comma and then one or two words separated by a space (\w (?: \w )?
), and then uses a forward lookahead to assert that those words must be followed by one of BENCH:
, CITATION:
or another word followed by a comma (\w ,
).
names = re.findall(r'((?:\w ), (?:\w (?: \w )?))(?= BENCH:| CITATION:| \w ,)', contents)
For your sample data, this yields:
['MAHAJAN, MEHR CHAND', 'MAHAJAN, MEHR CHAND', 'DAS, SUDHI RANJAN', 'BOSE, VIVIAN', 'HASAN, GHULAM']
This list can then be reformatted as you desire:
names = ', '.join((map(lambda n:' '.join(n.split(', ')[-1::-1]), names)))
Output:
'MEHR CHAND MAHAJAN, MEHR CHAND MAHAJAN, SUDHI RANJAN DAS, VIVIAN BOSE, GHULAM HASAN'