using jupyter with pandas I would need to extract in another column the reference that happens after any colon, for example:
nameis: joe doe, the student is....
nameis: patric test, this question is...
nameis: franck joe and he is.....
nameis: lucash de brown and the academic achievement......
the question becomes complex for me precisely when I have to extract just after nameis: the name and surname, unfortunately subsequently articulated by any text! the only reference in this case is nameis: which is recurring and I would like to put the name and surname on another dedicated column!
first_last_name,column_2....
joe doe,....
patric test,....
franck joe,......
lucash de brown,.....
not all names and surnames end with a comma, but in the extreme I am happy to bring only those! In the meantime, I thought of bringing the name closer to nameis:
df['column'] = df['column'].str.replace(r'nameis: ', '')
and then something like that, but unfortunately I'm still! especially when dealing with middle names
pat=r'([nameis:] [a-zA-Z])'
df['first_last_name']=df['column'].str.extract(pat,expand=False)
df
thanks to anyone who helps me!
CodePudding user response:
You can use str.extract
and a regex with named capturing groups:
df = pd.DataFrame({'column': ['nameis: joe doe, the student is....',
'nameis: patric test, this question is...',
'nameis: franck joe and he is.....',
'nameis: lucash de brown and the academic achievement......']})
df['column'].str.extract('nameis: (?P<first_last_name>[^,] ?)(?:,|\s*and) (?P<column_2>.*)')
output:
first_last_name column_2
0 joe doe the student is....
1 patric test this question is...
2 franck joe he is.....
3 lucash de brown the academic achievement......
If you just want the name:
print(df['column'].str.extract('nameis: (?P<first_last_name>[^,] ?)(?:,|\s*and)'))
output:
first_last_name
0 joe doe
1 patric test
2 franck joe
3 lucash de brown