Home > OS >  best way to extract the first-name and last-name from sentence python (Persian text)
best way to extract the first-name and last-name from sentence python (Persian text)

Time:12-18

I have over 20,000 first and last name and I want to check the sentence if in that sentence is any first-name or last-name of my dataset, this is my dataset

l-name   f-name  
میلاد  جورابلو
علی    احمدی
امیر    احمدی

this is the sentence sample

sentence = 'امروز با میلاد احمدی رفتم بیرون'

the english version the dataset

l-name    f-name
Smith     John
Johnson   Anthony
Williams  Ethan

this is the sentence in english version

sentence = 'I am going out with John Williams today'

I want my out put be like this

first_name = ['John']
last_name = ['Williams']

CodePudding user response:

If you would like to approach this in a naive way you could consider regex, however this is based on the assumption that all first and last names are capitalised.

sentence = 'I am going out with John Williams today'
name = re.search(r"[A-Z]{1}[a-z]  [A-Z]{1}[a-z] ", sentence).group()
print(name) # Outputs: John Williams

This will search for a capital letter followed by any number of lower-case letters, then a space, then a repeat of the previous pattern.

Outside of this, you could consider using Named Entity Recognition (NER) using pre-built libraries to identify names in text. Please see here for more details. https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/

Edit:

I should add that in the event where there are multiple names within the same sentence, you can apply re.findall():

sentence = 'I am going out with John Williams and William Smith today'
names = re.findall(r"[A-Z]{1}[a-z]  [A-Z]{1}[a-z] ", sentence)
print(names) # Outputs: ['John Williams', 'William Smith']

CodePudding user response:

Just get lists of names from each column and check if a string contains any element from those lists.

import pandas as pd 

names = [['John', 'Smith'], ['Anthony', 'Johnson'], ['Ethan', 'Williams']]
df = pd.DataFrame(names, columns = ['f_name', 'l_name'])


fname_list = df['f_name'].to_list()
lname_list = df['l_name'].to_list()

sentence = 'I am going out with John Williams today'
sentence = sentence.split()

fname_exist = [e for e in sentence if(e in fname_list)]
lname_exist = [e for e in sentence if(e in lname_list)]

if(len(fname_exist) > 0 and len(lname_exist) > 0):
    print('first name: '   fname_exist[0])
    print('last name name: '   lname_exist[0])

Output:

first name: John
last name: Williams
  • Related