I have one CSV file that contains the columns Movie_ID, and Movie_Name and another that contains the columns User_ID, Movie_ID, and Rating. How can I make a new dataframe where the columns are Movie_Name or Movie_ID and the rows are User_ID and Rating? For instance something like
User_ID | Movie1 | Movie2 | Movie3 |
---|---|---|---|
0 | 1 | 4 | 5 |
1 | 4 | 3 | 5 |
NOTE: not every user has rated every movie.
Thank you in advance.
CodePudding user response:
For this operation I think a join would do the job. Something like this:
df1.join(df2, on='Movie_ID')
If you provide an example a can make the code more specific.
CodePudding user response:
You could use Pandas pivot
to reshape the data based on column values. Since not every user has rated every movie, the result possibly contain NaN
s (which are cast into float
type by Pandas). If you wish to convert the result to int
you can append .fillna(0).astype(int)
at the end of the pivot
function.
Sample df
User_ID Movie_ID Rating
0 131 38552 4
1 203 38552 4
2 50 38552 1
3 585 95894 1
4 138 94611 4
5 632 73111 0
6 328 95894 0
7 461 73111 0
8 924 73111 3
...
...
Pivot function
df_new = df.pivot(index='User_ID', columns='Movie_ID', values='Rating').fillna(0).astype(int)
print(df_new)
Movie_ID 193846 38552 73111 83641 93843 94611 95894
User_ID
7 0 0 2 0 0 0 0
43 0 0 0 0 0 0 0
50 0 1 0 0 0 0 0
85 0 0 0 0 0 0 0
131 0 4 0 0 0 0 0
138 0 0 0 0 0 4 0
145 0 0 4 0 0 0 0
156 0 0 4 0 0 0 0
195 0 0 3 0 0 0 0
203 0 4 0 0 0 0 0
209 0 0 0 0 0 0 0