Home > Software engineering >  Replace combination of space, hyphen and text or a "by" using regex and pandas
Replace combination of space, hyphen and text or a "by" using regex and pandas

Time:03-20

I want to replace a combination of a space, an hyphen, a space and text or the combination "By [Author]". This is my data frame:

my_titles = ['Peter Rabbit - Volume II', 'Who stole my cookie  By Cole Pattesh', 'The Stormy Night -  Nia Costas']
adf = pd.DataFrame({'my_titles':my_titles})
adf
    my_titles
0   Peter Rabbit - Volume II
1   Who stole my cookie By Cole Pattesh
2   The Stormy Night - Nia Costas

My expected df is:

    my_titles
0   Peter Rabbit
1   Who stole my cookie
2   The Stormy Night

I have tried this, expecting regex to recognize the '\s' space and the '|' (or):

adf['my_titles'].replace('\s-\s*|\sBy\s*$','',regex=True)
adf

And I tried this too trying to chain the space and words:

adf['my_titles'].replace('[ - \w]|[ By \w]','',regex=True)
adf

Please, do you know what I am doing wrong?

CodePudding user response:

You can use

import pandas as pd
my_titles = ['Peter Rabbit - Volume II', 'Who stole my cookie  By Cole Pattesh', 'The Stormy Night -  Nia Costas']
adf = pd.DataFrame({'my_titles':my_titles})
adf['my_titles'] = adf['my_titles'].str.replace(r'\s (?:-\s |By\s [A-Z]).*', '', regex=True)

Ouput of print(adf['my_titles']):

0           Peter Rabbit
1    Who stole my cookie
2       The Stormy Night

See the regex demo. Details:

  • \s - one or more whitespaces
  • (?:-\s |By\s [A-Z]) - a - and one or more whitespaces, or By, one or more whitespaces, and an uppercase letter
  • .* - the rest of the line.
  • Related