I have over 20,000 first and last name and I want to check the sentence if in that sentence is any first-name
or last-name
of my dataset
, this is my dataset
l-name f-name
میلاد جورابلو
علی احمدی
امیر احمدی
this is the sentence
sample
sentence = 'امروز با میلاد احمدی رفتم بیرون'
the english version the dataset
l-name f-name
Smith John
Johnson Anthony
Williams Ethan
this is the sentence in english version
sentence = 'I am going out with John Williams today'
I want my out put be like this
first_name = ['John']
last_name = ['Williams']
CodePudding user response:
If you would like to approach this in a naive way you could consider regex, however this is based on the assumption that all first and last names are capitalised.
sentence = 'I am going out with John Williams today'
name = re.search(r"[A-Z]{1}[a-z] [A-Z]{1}[a-z] ", sentence).group()
print(name) # Outputs: John Williams
This will search for a capital letter followed by any number of lower-case letters, then a space, then a repeat of the previous pattern.
Outside of this, you could consider using Named Entity Recognition (NER) using pre-built libraries to identify names in text. Please see here for more details. https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/
Edit:
I should add that in the event where there are multiple names within the same sentence, you can apply re.findall()
:
sentence = 'I am going out with John Williams and William Smith today'
names = re.findall(r"[A-Z]{1}[a-z] [A-Z]{1}[a-z] ", sentence)
print(names) # Outputs: ['John Williams', 'William Smith']
CodePudding user response:
Just get lists of names from each column and check if a string contains any element from those lists.
import pandas as pd
names = [['John', 'Smith'], ['Anthony', 'Johnson'], ['Ethan', 'Williams']]
df = pd.DataFrame(names, columns = ['f_name', 'l_name'])
fname_list = df['f_name'].to_list()
lname_list = df['l_name'].to_list()
sentence = 'I am going out with John Williams today'
sentence = sentence.split()
fname_exist = [e for e in sentence if(e in fname_list)]
lname_exist = [e for e in sentence if(e in lname_list)]
if(len(fname_exist) > 0 and len(lname_exist) > 0):
print('first name: ' fname_exist[0])
print('last name name: ' lname_exist[0])
Output:
first name: John
last name: Williams