Why is Pandas DataFrame Function 'isin()' taking so much time?-CodePudding

The 'ratings' DataFrame has two columns of interest: User-ID and Book-Rating.

I'm trying to make a histogram showing the amount of books read per user in this dataset. In other words, I'm looking to count Book-Ratings per User-ID. I'll include the dataset in case anyone wants to check it out.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

!wget https://raw.githubusercontent.com/porterjenkins/cs180-intro-data-science/master/data/ratings_train.csv
ratings = pd.read_csv('ratings_train.csv')

# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
# Sort by User
ratings2 = ratings2.sort_values(by=['User-ID'])
usersList = []
booksRead = []

for i in range(2000):
  numBooksRead = ratings2.isin([i]).sum()['User-ID']
  if numBooksRead != 0:
    usersList.append(i)
    booksRead.append(numBooksRead)

new_dict = {'User_ID':usersList,'booksRated':booksRead}
usersBooks = pd.DataFrame(new_dict)

usersBooks

The code works as is, but it took almost 5 minutes to complete. And this is the problem: the dataset has 823,000 values. So if it took me 5 minutes to sort through only the first 2000 numbers, I don't think it's feasible to go through all of the data.

I also should admit, I'm sure there's a better way to make a DataFrame than creating two lists, turning them into a dict, and then making that a DataFrame.

Mostly I just want to know how to go through all this data in a way that won't take all day.

Thanks in advance!!

CodePudding user response：

It seems you want a list of user IDs, with the count how often an ID appears in the dataframe. Use

CodePudding user response：

First filter by column Book-Rating for remove 0 values and then count values by Series.value_counts:

ratings = pd.read_csv('ratings_train.csv')

ratings2 = ratings[ratings['Book-Rating'] != 0]

usersBooks = (ratings2['User-ID'].value_counts()
                                 .sort_index()
                                 .rename_axis('User_ID')
                                 .reset_index(name='booksRated'))


print (usersBooks)
       User_ID  booksRated
0            8           6
1           17           4
2           44           1
3           53           3
4           69           2
       ...         ...
21548   278773           3
21549   278782           2
21550   278843          17
21551   278851          10
21552   278854           4

[21553 rows x 2 columns]