Python: Getting only one string of interest out of a series of similar strings-CodePudding

I am looking into this dataset: https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset?select=rotten_tomatoes_movies.csv

I am interested in scores grouped by production companies, but some companies have subdivisions that are very similar to each other, e.g. 20th Century Fox, 20th Century Fox Distribution, 20th Century Fox Film, 20th Century Fox Film Corp., and so on.

I am searching for a way to collect all the movies produced under subdivision into one category, in this case 20th Century Fox - as I am not interested in their specific division.

I have done some initalization and cleaning of the data based on a Git depository:

import pandas as pd
import numpy as np

df = pd.read_csv('rotten_tomatoes_movies.csv')

cols_of_interest = ['movie_title', 'genres', 'content_rating', 'original_release_date', 'streaming_release_date', 'runtime', 'tomatometer_rating', 'tomatometer_count', 'audience_rating', 'audience_count', 'production_company']

df = df[cols_of_interest]

df.original_release_date.fillna(df.streaming_release_date, inplace=True)
df.drop(columns=['streaming_release_date'], inplace=True)
df.rename(columns={'original_release_date': 'release_date'}, inplace=True)

df = df[(df.tomatometer_count>0) & (df.audience_count>0)]

df.drop_duplicates(subset=['movie_title', 'release_date'], inplace=True)

df.dropna(subset=['genres', 'release_date'], inplace=True)

df = df.sort_values(by='release_date', ascending=True).reset_index(drop=True)

For my specific problem I had the idea to base analysis on the first word using:

df.production_company.astype('|S')
test = df.production_company.split(' ',1)

which gives

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 1: ordinal not in range(128)

Any ideas on other approaches or help on the current Error would be greatly appreciated!

CodePudding user response：

Maybe some production companies are french ones. According to Wikipedia : "The unicode string for \xe9 is an accented e - é". You can try to specify the encoding.

df = pd.read_csv('rotten_tomatoes_movies.csv', encoding='utf-8')

CodePudding user response：

I guess I've found it. It's still an encoding issue but this time related to the astype() method.

Example below works.

df = pd.DataFrame({'production_company': ["Cinefrance Studios"]})
df.production_company.astype("|S")

Example below doesn't and raises an exception, complaining about a byte value 0xe9 (corresponding to the é accented character) that equals to 233 that indeed exceeds 128 (2^7).

df = pd.DataFrame({'production_company': ["Cinéfrance Studios"]})
df.production_company.astype("|S")

It's due to the presence of the accented character in the name of the company. It seems that the '|S' action - haven't found any documentation about that - is restricted to ASCII encoding (American Standard Code for Information Interchange) that only take into account the 7 first bits of the byte. This way accents are not supported, their representation require at least extended ascii (the 8 bits). The better and most universal way would be to use unicode.

If your goal is to obtain bytes from the production company names, I can suggest this solution :

df.production_company = df.production_company.apply(lambda x: x.encode('utf-8'))

Otherwise may be there's a way to indicate to astype() that it has to use an utf-8 codec instead of the ascii codec.