Home > Enterprise >  How to get varying string splits into columns python pandas?
How to get varying string splits into columns python pandas?

Time:09-30

I am trying to make different columns from separated strings. My datasource is the https://grouplens.org/datasets/movielens/ ml-latest-small.zip (size: 1 MB)

movie_df = pd.read_csv('movies.csv')
movie_df.head(10)

Reading in the file, I have raw dataframe

I tried to do

movies_df = pd.read_csv('movies.csv', sep='|', encoding='latin-1',
names=['movie_id', 'movie_title','unknown', 'action','adventure', 'animation', 'childrens', 'comedy', 'crime', 'documentary', 'drama', 'fantasy','film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi', 'thriller', 'war', 'western'])
movies_df.head(10)

but this squishes everything before the separator to the first column and the first split on my genre also goes to the first column. Otherwise, it is what I need. See here.

How do I get all my genres of varying lengths to become a unique column after movieId and title? I want each genre to be a column with NaNs if it is not that column to set up for creating dummy variables later.

Edit: I did movies_df.head(10).to_dict() and the output was:

 'title': {0: 'Toy Story (1995)',
  1: 'Jumanji (1995)',
  2: 'Grumpier Old Men (1995)',
  3: 'Waiting to Exhale (1995)',
  4: 'Father of the Bride Part II (1995)',
  5: 'Heat (1995)',
  6: 'Sabrina (1995)',
  7: 'Tom and Huck (1995)',
  8: 'Sudden Death (1995)',
  9: 'GoldenEye (1995)'},
 'genres': {0: 'Adventure|Animation|Children|Comedy|Fantasy',
  1: 'Adventure|Children|Fantasy',
  2: 'Comedy|Romance',
  3: 'Comedy|Drama|Romance',
  4: 'Comedy',
  5: 'Action|Crime|Thriller',
  6: 'Comedy|Romance',
  7: 'Adventure|Children',
  8: 'Action',
  9: 'Action|Adventure|Thriller'}}

CodePudding user response:

The following seems to work. That said, ideally it should be changed to avoid iterating through the genres column in order to get the list of genres, since looping through columns is slow.

movie_df = pd.read_csv('movies.csv')
genre_set = set()
for lst in movie_df['genres'].str.split('|'):
    genre_set.update(lst)
for g in genre_set:
    movie_df[g] = np.nan
    movie_df.loc[movie_df['genres'].str.contains(g),g] = g

The first for loop can be replaced with the single line genre_set.update(*movie_df['genres'].str.split('|')); I don't believe this changes performance.

The resulting frame movie_df looks like this:

                                title  \
0                    Toy Story (1995)   
1                      Jumanji (1995)   
2             Grumpier Old Men (1995)   
3            Waiting to Exhale (1995)   
4  Father of the Bride Part II (1995)   
5                         Heat (1995)   
6                      Sabrina (1995)   
7                 Tom and Huck (1995)   
8                 Sudden Death (1995)   
9                    GoldenEye (1995)   

                                        genres  Action  Romance  Thriller  \
0  Adventure|Animation|Children|Comedy|Fantasy     NaN      NaN       NaN   
1                   Adventure|Children|Fantasy     NaN      NaN       NaN   
2                               Comedy|Romance     NaN  Romance       NaN   
3                         Comedy|Drama|Romance     NaN  Romance       NaN   
4                                       Comedy     NaN      NaN       NaN   
5                        Action|Crime|Thriller  Action      NaN  Thriller   
6                               Comedy|Romance     NaN  Romance       NaN   
7                           Adventure|Children     NaN      NaN       NaN   
8                                       Action  Action      NaN       NaN   
9                    Action|Adventure|Thriller  Action      NaN  Thriller   

   Adventure  Crime  Children  Comedy  Drama  Animation  Fantasy  
0  Adventure    NaN  Children  Comedy    NaN  Animation  Fantasy  
1  Adventure    NaN  Children     NaN    NaN        NaN  Fantasy  
2        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
3        NaN    NaN       NaN  Comedy  Drama        NaN      NaN  
4        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
5        NaN  Crime       NaN     NaN    NaN        NaN      NaN  
6        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
7  Adventure    NaN  Children     NaN    NaN        NaN      NaN  
8        NaN    NaN       NaN     NaN    NaN        NaN      NaN  
9  Adventure    NaN       NaN     NaN    NaN        NaN      NaN  
  • Related