What Python RegEx can I use to indicate a pattern only in the end of an Excel cell-CodePudding

I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:

Movie Titles/Category/Rating
Wolf of Wall Street A-13 x 9
Django Unchained IMDB x 8
The EXPL Haunted House FEAR x 7
Silver Lining DC-23 x 8

This is what I want the results to look like:

Title	Category	Rating
Wolf of Wall Street	A-13	9
Django Unchained	IMDB	8
The EXPL Haunted House	FEAR	7
Silver Lining	DC-23	8

Here is the RegEx I used to successfully separate the cells: For Rating, this RegEx worked:

data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)

However, to separate Category from movie titles, this RegEx doesn't work:

data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)

Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern '\s[A-Z]{4}$' is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?

CodePudding user response：

Assuming there is always x between Category and Rating, and the Category has no spaces in it, then the following should get what you want:

(.*) (.*) x (\d )

CodePudding user response：

I think

'((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4})) x'

would work for you - to indicate that you want the part of the string that comes right before the x. (Assuming that pattern is always true for your data.)

CodePudding user response：

You can use

import pandas as pd

df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})

df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s (?P<Category>\S )\s x\s (?P<Rating>\d )$', expand=True)

See the regex demo.

Details:

^ - start of string
(?P<Movie>.*?) - Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible
\s - one or more whitespaces
(?P<Category>\S ) - Group "Category": one or more non-whitespace chars
\s x\s - x enclosed with one or more whitespaces
(?P<Rating>\d ) - Group "Rating": one or more digits
$ - end of string.