Home > OS >  What Python RegEx can I use to indicate a pattern only in the end of an Excel cell
What Python RegEx can I use to indicate a pattern only in the end of an Excel cell

Time:11-18

I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:

Movie Titles/Category/Rating
Wolf of Wall Street A-13 x 9
Django Unchained IMDB x 8
The EXPL Haunted House FEAR x 7
Silver Lining DC-23 x 8

This is what I want the results to look like:

Title Category Rating
Wolf of Wall Street A-13 9
Django Unchained IMDB 8
The EXPL Haunted House FEAR 7
Silver Lining DC-23 8

Here is the RegEx I used to successfully separate the cells: For Rating, this RegEx worked:

data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)

However, to separate Category from movie titles, this RegEx doesn't work:

data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)

Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern '\s[A-Z]{4}$' is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?

CodePudding user response:

Assuming there is always x between Category and Rating, and the Category has no spaces in it, then the following should get what you want:

(.*) (.*) x (\d )

CodePudding user response:

I think

'((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4})) x'

would work for you - to indicate that you want the part of the string that comes right before the x. (Assuming that pattern is always true for your data.)

CodePudding user response:

You can use

import pandas as pd

df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})

df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s (?P<Category>\S )\s x\s (?P<Rating>\d )$', expand=True)

See the regex demo.

Details:

  • ^ - start of string
  • (?P<Movie>.*?) - Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible
  • \s - one or more whitespaces
  • (?P<Category>\S ) - Group "Category": one or more non-whitespace chars
  • \s x\s - x enclosed with one or more whitespaces
  • (?P<Rating>\d ) - Group "Rating": one or more digits
  • $ - end of string.
  • Related