I want to replace a combination of a space, an hyphen, a space and text or the combination "By [Author]". This is my data frame:
my_titles = ['Peter Rabbit - Volume II', 'Who stole my cookie By Cole Pattesh', 'The Stormy Night - Nia Costas']
adf = pd.DataFrame({'my_titles':my_titles})
adf
my_titles
0 Peter Rabbit - Volume II
1 Who stole my cookie By Cole Pattesh
2 The Stormy Night - Nia Costas
My expected df is:
my_titles
0 Peter Rabbit
1 Who stole my cookie
2 The Stormy Night
I have tried this, expecting regex to recognize the '\s' space and the '|' (or):
adf['my_titles'].replace('\s-\s*|\sBy\s*$','',regex=True)
adf
And I tried this too trying to chain the space and words:
adf['my_titles'].replace('[ - \w]|[ By \w]','',regex=True)
adf
Please, do you know what I am doing wrong?
CodePudding user response:
You can use
import pandas as pd
my_titles = ['Peter Rabbit - Volume II', 'Who stole my cookie By Cole Pattesh', 'The Stormy Night - Nia Costas']
adf = pd.DataFrame({'my_titles':my_titles})
adf['my_titles'] = adf['my_titles'].str.replace(r'\s (?:-\s |By\s [A-Z]).*', '', regex=True)
Ouput of print(adf['my_titles'])
:
0 Peter Rabbit
1 Who stole my cookie
2 The Stormy Night
See the regex demo. Details:
\s
- one or more whitespaces(?:-\s |By\s [A-Z])
- a-
and one or more whitespaces, orBy
, one or more whitespaces, and an uppercase letter.*
- the rest of the line.