I have a dataframe which consists of two columns, full name and last name. Sometimes, the last name column is not filled properly. In such cases, the last name would be found as the last word in the full name column between parenthesis. I would like to update my last name column for those cases where parenthesis are found to be equal to the word between parenthesis.
Code
import pandas as pd
df = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', '-', 'mac', '-']
})
result_to_be = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', 'james', 'mac', 'arnold']
})
print(df)
print(result_to_be)
I have tried to implement the contains function to be used as a mask but it seems to be messing the check regex when checking if it contains ')' or '(' characters
df['full'].str.contains(')')
The error it shows is
re.error: unbalanced parenthesis at position 0
CodePudding user response:
You can use .str.findall
to get the value between the parentheses and df.loc
to assign that where last
is -
:
df.loc[df['last'] == '-', 'last'] = df['full'].str.findall('\((. ?)\)').str[-1]
Output:
>>> df
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold
CodePudding user response:
For a slightly different syntax, you could also use extract
df.loc[df['last'] == '-', 'last'] = df['full'].str.extract('.*\((.*)\)', expand=False)
Output:
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold