The 'ratings' DataFrame has two columns of interest: User-ID and Book-Rating.
I'm trying to make a histogram showing the amount of books read per user in this dataset. In other words, I'm looking to count Book-Ratings per User-ID. I'll include the dataset in case anyone wants to check it out.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!wget https://raw.githubusercontent.com/porterjenkins/cs180-intro-data-science/master/data/ratings_train.csv
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
# Sort by User
ratings2 = ratings2.sort_values(by=['User-ID'])
usersList = []
booksRead = []
for i in range(2000):
numBooksRead = ratings2.isin([i]).sum()['User-ID']
if numBooksRead != 0:
usersList.append(i)
booksRead.append(numBooksRead)
new_dict = {'User_ID':usersList,'booksRated':booksRead}
usersBooks = pd.DataFrame(new_dict)
usersBooks
The code works as is, but it took almost 5 minutes to complete. And this is the problem: the dataset has 823,000 values. So if it took me 5 minutes to sort through only the first 2000 numbers, I don't think it's feasible to go through all of the data.
I also should admit, I'm sure there's a better way to make a DataFrame than creating two lists, turning them into a dict, and then making that a DataFrame.
Mostly I just want to know how to go through all this data in a way that won't take all day.
Thanks in advance!!
CodePudding user response:
It seems you want a list of user IDs, with the count how often an ID appears in the dataframe. Use
CodePudding user response:
First filter by column Book-Rating
for remove 0
values and then count values by Series.value_counts
:
ratings = pd.read_csv('ratings_train.csv')
ratings2 = ratings[ratings['Book-Rating'] != 0]
usersBooks = (ratings2['User-ID'].value_counts()
.sort_index()
.rename_axis('User_ID')
.reset_index(name='booksRated'))
print (usersBooks)
User_ID booksRated
0 8 6
1 17 4
2 44 1
3 53 3
4 69 2
... ...
21548 278773 3
21549 278782 2
21550 278843 17
21551 278851 10
21552 278854 4
[21553 rows x 2 columns]