I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:
Movie Titles/Category/Rating |
---|
Wolf of Wall Street A-13 x 9 |
Django Unchained IMDB x 8 |
The EXPL Haunted House FEAR x 7 |
Silver Lining DC-23 x 8 |
This is what I want the results to look like:
Title | Category | Rating |
---|---|---|
Wolf of Wall Street | A-13 | 9 |
Django Unchained | IMDB | 8 |
The EXPL Haunted House | FEAR | 7 |
Silver Lining | DC-23 | 8 |
Here is the RegEx I used to successfully separate the cells: For Rating, this RegEx worked:
data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)
However, to separate Category from movie titles, this RegEx doesn't work:
data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)
Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern '\s[A-Z]{4}$' is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?
CodePudding user response:
Assuming there is always x
between Category and Rating, and the Category has no spaces in it, then the following should get what you want:
(.*) (.*) x (\d )
CodePudding user response:
I think
'((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4})) x'
would work for you - to indicate that you want the part of the string that comes right before the x
. (Assuming that pattern is always true for your data.)
CodePudding user response:
You can use
import pandas as pd
df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})
df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s (?P<Category>\S )\s x\s (?P<Rating>\d )$', expand=True)
See the regex demo.
Details:
^
- start of string(?P<Movie>.*?)
- Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible\s
- one or more whitespaces(?P<Category>\S )
- Group "Category": one or more non-whitespace chars\s x\s
-x
enclosed with one or more whitespaces(?P<Rating>\d )
- Group "Rating": one or more digits$
- end of string.