I have a one-hot encoded data frame that is similar to the one below:
review_score | Action & Adventure | Classic Movies | TV Comedies | TV Mysteries |
---|---|---|---|---|
4 | 0 | 0 | 1 | 0 |
10 | 1 | 0 | 0 | 0 |
5 | 1 | 0 | 0 | 0 |
7 | 0 | 1 | 0 | 0 |
8 | 0 | 0 | 1 | 0 |
7 | 0 | 0 | 0 | 1 |
it contains 40 columns and I do not think it would be feasible to put every column name into the functions.
How can I pivot or rearrange the table so it looks something like the one below:
Genre | review_score |
---|---|
Action & Adventure | 10 |
Action & Adventure | 5 |
Classic Movies | 7 |
TV Comedies | 4 |
TV Comedies | 8 |
TV Mysteries | 7 |
I want each genre to have a row for each review score given in order to plot a boxplot
I have tried melt and wide_to_long functions but cannot get the desired data frame. Please advice!
CodePudding user response:
In pandas 1.5.0 there is pd.from_dummies
import pandas as pd
genre = pd.from_dummies(df.drop(columns="review_score"))
pd.concat([genre[""].rename("Genre"), df["review_score"]], axis=1)
alternatively, use pd.DataFrame.idxmax
genre = df.drop(columns="review_score").idxmax(axis=1)
pd.concat([genre.rename("Genre"), df["review_score"]], axis=1)
CodePudding user response:
If you don't have 1.5.0 version of pandas, you could do:
cols = ['review_score','Genre']
s = df.set_index('review_score').stack()
out = s[s==1].rename_axis(cols).reset_index()[cols]
print(out)
review_score Genre
0 4 TV Comedies
1 10 Action & Adventure
2 5 Action & Adventure
3 7 Classic Movies
4 8 TV Comedies
5 7 TV Mysteries
CodePudding user response:
You can use iloc
and idxmax(axis=1)
to find column_name
then use pandas.Series.to_frame()
and pandas.DataFrame.assign
to add new column to old dataframe.
new_df = df['review_score'].to_frame().assign(Genre = df.iloc[:, 1:].idxmax(axis=1))
print(new_df)
Output:
review_score Genre
0 4 TV Comedies
1 10 Action & Adventure
2 5 Action & Adventure
3 7 Classic Movies
4 8 TV Comedies
5 7 TV Mysteries
CodePudding user response:
hello i hope that help you, i work it with for loop
l=[]
k=[]
for i in df.columns[1:]:
for j in range(len(df)):
if df[i][j]==1:
l.append(df.review_score[j])
k.append(i)
result= pd.DataFrame()
result['Genre']=k
result['review_score']=l
Out put :
Genre review_score
0 Action_Adventure 10
1 Action_Adventure 5
2 ClassicMovies 7
3 TVComedies 4
4 TVComedies 8
5 TVMysteries 7