If I have the following dataframe, df, with millons variables
id score
140 0.1223
142 0.01123
148 0.1932
166 0.0226
.. ..
My problem is,
How can I study the distribution of each percentile?
So, my idea was divide score into percentiles and see how much percentage corresponds to each one.
I would like to get something like
percentil countofindex percentage
1 154.000
2 100.320
3 250.000 !
...
where countofindex, is the number of differents Id, and percentage is the percentage that represent the first, second,.. percentil.
So for this, I get df['percentage'] = df['score'] / df['score'].sum() * 100, but this is the percentage of all data.
I know it must be a very simple question, but if you could help me I would be very grateful
CodePudding user response:
To get the percentage of each score you can get the sum
of all scores and divide each one by it:
df= pd.DataFrame({'score': [0.1223,0.01123,0.1932]})
df['percentage'] = df['score'] / df['score'].sum() * 100
score percentage
0 0.12230 37.431518
1 0.01123 3.437089
2 0.19320 59.131393
To sort you can use .sort_values
:
df.sort_values(by=['percentage'], ascending=False)
df.insert(1, 'percentile', range(1,len(df) 1))
score percentile percentage
2 0.19320 1 59.131393
0 0.12230 2 37.431518
1 0.01123 3 3.437089
CodePudding user response:
let's go through the following example.
print(df)
0
0 0.127975
1 0.146976
2 0.721326
3 0.003722
df[0].sum()
1.0
Now, to create the chart:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
langs = [str(round(i[0]*100,2)) for i in df.iloc]
data = [round(i[0]*100,2) for i in df.iloc]
ax.bar(langs,data)
plt.ylabel("Percentiles")
plt.xlabel("Values")
plt.xticks(rotation=45)
plt.show()
If you want to add it to the dataset, you can use the code below.
df['Percentiles (%)']=df.apply(lambda x: round(x*100,2))
print(df)
0 Percentiles (%)
0 0.127975 12.80
1 0.146976 14.70
2 0.721326 72.13
3 0.003722 0.37