I have a csv formated like:
movie_id | ratings |
---|---|
1 | [{'customer_id': 1, "star": "5"}, {'customer_id': 3, "star": "4"}, {'customer_id': 2, "star": "3"}] |
2 | [{'customer_id': 1, "star": "4"}, {'customer_id': 2, "star": "3"}, {'customer_id': 3, "star": "5"}] |
which i want to transform to:
customer_id_1 | customer_id_2 | customer_id_3 | |
---|---|---|---|
movie_id_1 | 5 | 4 | 3 |
movie_id_2 | 4 | 3 | 5 |
How it can be done by using python pandas ?
CodePudding user response:
Explode the dataframe on ratings and apply series on the dictionary column. An elongated row format is better for manipulation than the one you have specified.
df = df.explode('ratings')
df[['customer_id', 'star']] = df.ratings.apply(pd.Series)
Output
movie_id ratings customer_id star
0 1 {'customer_id': 1, 'star': '5'} 1 5
0 1 {'customer_id': 3, 'star': '4'} 3 4
0 1 {'customer_id': 2, 'star': '3'} 2 3
1 2 {'customer_id': 1, 'star': '4'} 1 4
1 2 {'customer_id': 2, 'star': '3'} 2 3
1 2 {'customer_id': 3, 'star': '5'} 3 5
If you still want your format, follow the code below
df.drop(columns=['ratings']).set_index(['movie_id', 'customer_id']).unstack(['customer_id'])
Output (Multi-index Dataframe)
star
customer_id 1 2 3
movie_id
1 5 3 4
2 4 3 5
If there's missing star value for a customer and/or a movie. Using fillna
will help.
df.drop(columns=['ratings']).set_index(['movie_id', 'customer_id']).unstack(['customer_id']).fillna(0)