I'm working with a huge set of data that I can't work with in excel so I'm using Pandas/Python, but I'm relatively new to it. I have this column of book titles that also include genres, both before and after the title. I only want the column to contain book titles, so what would be the easiest way to remove the genres?
Here is an example of what the column contains:
Book Labels
Science Fiction | Drama | Dune
Thriller | Mystery | The Day I Died
Thriller | Razorblade Tears | Family | Drama
Comedy | How To Marry Keanu Reeves In 90 Days | Drama
...
So above, the book titles would be Dune, The Day I Died, Razorblade Tears, and How To Marry Keanu Reeves In 90 Days, but as you can see the genres precede as well as succeed the titles.
I was thinking I could create a list of all the genres (as there are only so many) and remove those from the column along with the "|" characters, but if anyone has suggestions on a simpler way to remove the genres and "|" key, please help me out.
CodePudding user response:
It is an enhancement to @tdy Regex solution. The original regex Family|Drama
will match the words "Family" and "Drama" in the string. If the book title contains the words in gernes
, the words will be removed as well.
Supposed that the labels are separated by " | ", there are three match conditions we want to remove.
- Gerne at start of string. e.g.
Drama | ...
- Gerne in the middle. e.g.
... | Drama | ...
- Gerne at end of string. e.g.
... | Drama
Use regex (^|\| )(?:Family|Drama)(?=( \||$))
to match one of three conditions. Note that | Drama | Family
has 2 overlapped matches, here I use ?=( \||$)
to avoid matching once only. See this problem [Use regular expressions to replace overlapping subpatterns] for more details.
>>> genres = ["Family", "Drama"]
>>> df
# Book Labels
# 0 Drama | Drama 123 | Family
# 1 Drama 123 | Drama | Family
# 2 Drama | Family | Drama 123
# 3 123 Drama 123 | Family | Drama
# 4 Drama | Family | 123 Drama
>>> re_str = "(^|\| )(?:{})(?=( \||$))".format("|".join(genres))
>>> df['Book Labels'] = df['Book Labels'].str.replace(re_str, "", regex=True)
# 0 | Drama 123
# 1 Drama 123
# 2 | Drama 123
# 3 123 Drama 123
# 4 | 123 Drama
>>> df["Book Labels"] = df["Book Labels"].str.strip("| ")
# 0 Drama 123
# 1 Drama 123
# 2 Drama 123
# 3 123 Drama 123
# 4 123 Drama
CodePudding user response:
Since your sample shows that the title is not in a consistent location, I'd go with your idea:
create a list of all the genres (as there are only so many) and remove those from the column along with the "|" characters
Use Series.replace
to remove the genres and Series.str.strip
to strip the separators:
genres = ['Science Fiction', 'Drama', 'Thriller', 'Mystery', 'Family', 'Comedy']
df['Book Labels'] = df['Book Labels'].replace('|'.join(genres), '', regex=True).str.strip('| ')
# Book Labels
# 0 Dune
# 1 The Day I Died
# 2 Razorblade Tears
# 3 How To Marry Keanu Reeves In 90 Days
If the title is always in a consistent location, say 3rd in the list, then we would not need the list of genres. We could use Series.str.split
with expand=True
and get the 3rd column (index 2):
# only works if the title is always in position 3 (index 2)
index = 2
df['Book Labels'] = df['Book Labels'].str.split('|', expand=True)[index]