Home > Software design >  Pandas extract in another column the reference name, middle name and surname
Pandas extract in another column the reference name, middle name and surname

Time:11-04

using jupyter with pandas I would need to extract in another column the reference that happens after any colon, for example:

nameis: joe doe, the student is....
nameis: patric test, this question is...
nameis: franck joe and he is.....
nameis: lucash de brown and the academic achievement......

the question becomes complex for me precisely when I have to extract just after nameis: the name and surname, unfortunately subsequently articulated by any text! the only reference in this case is nameis: which is recurring and I would like to put the name and surname on another dedicated column!

first_last_name,column_2....
joe doe,....
patric test,....
franck joe,......
lucash de brown,.....

not all names and surnames end with a comma, but in the extreme I am happy to bring only those! In the meantime, I thought of bringing the name closer to nameis:

df['column'] = df['column'].str.replace(r'nameis: ', '')

and then something like that, but unfortunately I'm still! especially when dealing with middle names

pat=r'([nameis:] [a-zA-Z])'
df['first_last_name']=df['column'].str.extract(pat,expand=False)
df

thanks to anyone who helps me!

CodePudding user response:

You can use str.extract and a regex with named capturing groups:

df = pd.DataFrame({'column': ['nameis: joe doe, the student is....',
                              'nameis: patric test, this question is...',
                              'nameis: franck joe and he is.....',
                              'nameis: lucash de brown and the academic achievement......']})

df['column'].str.extract('nameis: (?P<first_last_name>[^,] ?)(?:,|\s*and) (?P<column_2>.*)')

output:

   first_last_name                        column_2
0          joe doe              the student is....
1      patric test             this question is...
2       franck joe                      he is.....
3  lucash de brown  the academic achievement......

If you just want the name:

print(df['column'].str.extract('nameis: (?P<first_last_name>[^,] ?)(?:,|\s*and)'))

output:

   first_last_name
0          joe doe
1      patric test
2       franck joe 
3  lucash de brown 
  • Related