I am trying to select the unique user id that is going to a specific user.
let's say I want it to be 200,000 rows from 10M rows. I want only only 1500 unique user id with around 200,000 rows(the rows does not need to be specific a few thousands is okay). Each user has multiple ratings.
Here's the dataset link.
How I load the data.
names = ['user_id', 'movie_id', 'rating', 'timestamp']
df = pd.read_csv('ratings.csv', sep='::', names=names)
print(df)
Is there any way to group it like that with pandas?
CodePudding user response:
I didn't test the real dataset, but the logic should be something like:
# select 1500 unique users
import numpy as np
users = np.random.choice(df['user_id'].unique(), size=1500, replace=False)
# filter the users and get (up to) 200k random rows
df_sample = df[df['user_id'].isin(users)].sample(n=200000)
documentations: numpy.random.choice
and pandas.DataFrame.sample