Best way to remove specific words from column in pandas dataframe?-CodePudding

I'm working with a huge set of data that I can't work with in excel so I'm using Pandas/Python, but I'm relatively new to it. I have this column of book titles that also include genres, both before and after the title. I only want the column to contain book titles, so what would be the easiest way to remove the genres?

Here is an example of what the column contains:

Book Labels
Science Fiction | Drama | Dune
Thriller | Mystery | The Day I Died
Thriller | Razorblade Tears | Family | Drama
Comedy | How To Marry Keanu Reeves In 90 Days | Drama
...

So above, the book titles would be Dune, The Day I Died, Razorblade Tears, and How To Marry Keanu Reeves In 90 Days, but as you can see the genres precede as well as succeed the titles.

I was thinking I could create a list of all the genres (as there are only so many) and remove those from the column along with the "|" characters, but if anyone has suggestions on a simpler way to remove the genres and "|" key, please help me out.

CodePudding user response：

It is an enhancement to @tdy Regex solution. The original regex Family|Drama will match the words "Family" and "Drama" in the string. If the book title contains the words in gernes, the words will be removed as well.

Supposed that the labels are separated by " | ", there are three match conditions we want to remove.

Gerne at start of string. e.g. Drama | ...
Gerne in the middle. e.g. ... | Drama | ...
Gerne at end of string. e.g. ... | Drama

Use regex (^|\| )(?:Family|Drama)(?=( \||$)) to match one of three conditions. Note that | Drama | Family has 2 overlapped matches, here I use ?=( \||$) to avoid matching once only. See this problem [Use regular expressions to replace overlapping subpatterns] for more details.

>>> genres = ["Family", "Drama"]

>>> df

#                       Book Labels
# 0      Drama | Drama 123 | Family
# 1      Drama 123 | Drama | Family
# 2      Drama | Family | Drama 123
# 3  123 Drama 123 | Family | Drama
# 4      Drama | Family | 123 Drama

>>> re_str = "(^|\| )(?:{})(?=( \||$))".format("|".join(genres))

>>> df['Book Labels'] = df['Book Labels'].str.replace(re_str, "", regex=True)

# 0       | Drama 123
# 1        Drama 123
# 2        | Drama 123
# 3    123 Drama 123
# 4        | 123 Drama

>>> df["Book Labels"] = df["Book Labels"].str.strip("| ")

# 0        Drama 123
# 1        Drama 123
# 2        Drama 123
# 3    123 Drama 123
# 4        123 Drama

CodePudding user response：

Since your sample shows that the title is not in a consistent location, I'd go with your idea:

create a list of all the genres (as there are only so many) and remove those from the column along with the "|" characters

Use Series.replace to remove the genres and Series.str.strip to strip the separators:

genres = ['Science Fiction', 'Drama', 'Thriller', 'Mystery', 'Family', 'Comedy']
df['Book Labels'] = df['Book Labels'].replace('|'.join(genres), '', regex=True).str.strip('| ')

#                             Book Labels
# 0                                  Dune
# 1                        The Day I Died
# 2                      Razorblade Tears
# 3  How To Marry Keanu Reeves In 90 Days

If the title is always in a consistent location, say 3rd in the list, then we would not need the list of genres. We could use Series.str.split with expand=True and get the 3rd column (index 2):

# only works if the title is always in position 3 (index 2)
index = 2
df['Book Labels'] = df['Book Labels'].str.split('|', expand=True)[index]