Home > front end >  Distribution on percentile from dataframe
Distribution on percentile from dataframe

Time:12-28

If I have the following dataframe, df, with millons variables

id      score
140    0.1223
142    0.01123
148    0.1932
166    0.0226
..       ..

My problem is,

How can I study the distribution of each percentile?

So, my idea was divide score into percentiles and see how much percentage corresponds to each one.

I would like to get something like

percentil countofindex  percentage
  1          154.000          
  2          100.320       
  3          250.000       !
 ... 

where countofindex, is the number of differents Id, and percentage is the percentage that represent the first, second,.. percentil.

So for this, I get df['percentage'] = df['score'] / df['score'].sum() * 100, but this is the percentage of all data.

I know it must be a very simple question, but if you could help me I would be very grateful

CodePudding user response:

To get the percentage of each score you can get the sum of all scores and divide each one by it:

df= pd.DataFrame({'score': [0.1223,0.01123,0.1932]})

df['percentage'] = df['score'] / df['score'].sum() * 100
     score  percentage
0  0.12230   37.431518
1  0.01123    3.437089
2  0.19320   59.131393

To sort you can use .sort_values:

df.sort_values(by=['percentage'], ascending=False)
df.insert(1, 'percentile', range(1,len(df) 1))
     score  percentile  percentage
2  0.19320           1   59.131393
0  0.12230           2   37.431518
1  0.01123           3    3.437089

CodePudding user response:

let's go through the following example.

print(df)
           0
0   0.127975
1   0.146976
2   0.721326
3   0.003722

df[0].sum()
1.0

Now, to create the chart:

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
langs = [str(round(i[0]*100,2)) for i in df.iloc]
data = [round(i[0]*100,2) for i in df.iloc]
ax.bar(langs,data)
plt.ylabel("Percentiles")
plt.xlabel("Values")
plt.xticks(rotation=45)
plt.show()

If you want to add it to the dataset, you can use the code below.

df['Percentiles (%)']=df.apply(lambda x: round(x*100,2))

print(df)
          0  Percentiles (%)
0  0.127975            12.80
1  0.146976            14.70
2  0.721326            72.13
3  0.003722             0.37
  • Related