I am trying to do this. So, currently my df look like this.
col_names = ['movie_id', 'movie_title', 'genres']
df = pd.read_csv('/content/drive/MyDrive/testing/ml-1m/movies.csv', sep='::', names=col_names, encoding='latin-1')
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
while the columns are such "movie_id, name, and genre".
I want it to look like this.
The columns would be movie_id, name,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
1::Toy Story (1995)::0::0::1::1::1::0::0::0::0::0::0::0::0::0::0::0::0::0::0
2::Jumanji (1995)::0::1::0::1::0::0::0::0::1::0::0::0::0::0::0::0::0::0::0
.
.
...
Basically, I want to turn genre column to multiple columns and 1 will be there if it match the columns.
Is there anyway to do this with pandas?
CodePudding user response:
IIUC you can do it like that:
res = (df
.join(df['genre'].str.get_dummies())
.drop('genre',axis=1))
print(res)
print(res)
movie_id name Action Adventure Animation Children's Comedy Crime Drama Fantasy Romance Thriller
0 1 Toy Story (1995) 0 0 1 1 1 0 0 0 0 0
1 2 Jumanji (1995) 0 1 0 1 0 0 0 1 0 0
2 3 Grumpier Old Men (1995) 0 0 0 0 1 0 0 0 1 0
3 4 Waiting to Exhale (1995) 0 0 0 0 1 0 1 0 0 0
4 5 Father of the Bride Part II (1995) 0 0 0 0 1 0 0 0 0 0
5 6 Heat (1995) 1 0 0 0 0 1 0 0 0 1
6 7 Sabrina (1995) 0 0 0 0 1 0 0 0 1 0
7 8 Tom and Huck (1995) 0 1 0 1 0 0 0 0 0 0
8 9 Sudden Death (1995) 1 0 0 0 0 0 0 0 0 0
9 10 GoldenEye (1995) 1 1 0 0 0 0 0 0 0 1
If you don't want to delete the original genre
column, just remove the last part with drop