The dataframe is like this;
Cluster | Genre 1 | Genre 2 | Genre 3 | Genre 4 | Genre 5 |
---|---|---|---|---|---|
1 | 10 | 31 | 5 | 3 | 23 |
2 | 53 | 12 | 6 | 9 | 7 |
3 | 44 | 73 | 1 | 9 | 13 |
As output, I want something like this, so I can see what genres are the dominant ones in each cluster.
Cluster | 1st | 2nd | 3rd |
---|---|---|---|
1 | Genre 2 | Genre 5 | Genre 1 |
2 | Genre 1 | Genre 2 | Genre 4 |
3 | Genre 2 | Genre 1 | Genre 5 |
I want to show the top 3 "genres" from each cluster in a graph, I have no idea how I would do this for a row instead of columns. Is anyone here familiar with this?
CodePudding user response:
You can use numpy.argsort
on df.values
and axis=1
and select three largest and use df.columns for getting column name:
import pandas as pd
import numpy as np
df = df.set_index('Cluster')
res = pd.DataFrame(df.columns[np.argsort(-1*df.values,axis=1)[:, :3]],
columns=['1st', '2nd',' 3rd'])
print(res)
Output:
1st 2nd 3rd
0 Genre 2 Genre 5 Genre 1
1 Genre 1 Genre 2 Genre 4
2 Genre 2 Genre 1 Genre 5
CodePudding user response:
You can use nlargest with apply for each row as given in the code below to solve this problem,
>>> import pandas as pd
>>>
>>> dictA = {'genre1': [10,20,56,2,30,44],'genre2': [ 9,70,96,7,3,54],'genre3': [ 12,10,16,23,39,4],'genre4': [ 40,29,26,15,43,14]}
>>>
... dfA = pd.DataFrame(dictA)
>>> dfA.index.name = 'cluster'
>>>
... dfA.reset_index(inplace=True)
>>>
>>> dfA
cluster genre1 genre2 genre3 genre4
0 0 10 9 12 40
1 1 20 70 10 29
2 2 56 96 16 26
3 3 2 7 23 15
4 4 30 3 39 43
5 5 44 54 4 14
>>> df1 = (dfA.set_index('cluster')
... .apply(lambda x: pd.Series(x.nlargest(3).index), axis=1)
... .reset_index())
>>> df1
cluster 0 1 2
0 0 genre4 genre3 genre1
1 1 genre2 genre4 genre1
2 2 genre2 genre1 genre4
3 3 genre3 genre4 genre2
4 4 genre4 genre3 genre1
5 5 genre2 genre1 genre4
Adapted from here