I have a data frame with sales by month and the price at each month's observation. Is there a way to take those prices, and break them up into 10 equal groups based on the price itself, so group 1 would be 10% with the lowest prices, and group 10 be 10% of the observations with the highest prices?
I tried pd.cut
but I need to know where those 10% cut points are to bin correctly and I'm not sure how to get that when I have over 30K rows of data. I also need equal number of observations in each group.
Here is some test data:
from scipy.stats import poisson
import pandas as pd
data = poisson.rvs(mu=42.73, size=10000)
data_ = pd.DataFrame(data, columns=['price'])
Thank you.
CodePudding user response:
First let's create a dataframe to use as an example for OP's use case
import pandas as pd
sales = pd.DataFrame({'sales': [100, 200, 300, 400, 500, 600],
'price': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]})
Depending on one's goals one can use:
sales['group'] = pd.cut(sales['price'], 10, labels=False)
[Out]:
sales price group
0 100 1.0 0
1 200 2.0 1
2 300 3.0 3
3 400 4.0 5
4 500 5.0 7
sales['group'] = pd.qcut(sales['price'], 10, labels=False)
[Out]:
sales price group
0 100 1.0 0
1 200 2.0 1
2 300 3.0 3
3 400 4.0 5
4 500 5.0 7
In this example the output is the same, but that is not necessarily the case. Therefore, for more information on the difference between pandas.cut
and pandas.qcut
, see this thread: What is the difference between pandas.qcut and pandas.cut?
Additionally, if one wants to see the labels for the resulting bins, change labels=False
to labels=True
.
Finally, in order to get the amount of observations in each group, one can use pandas.DataFrame.groupby
. Let's create a new dataframe called group_counts
group_counts = sales.groupby('group').size().reset_index(name='counts')
[Out]:
group counts
0 0 1
1 1 1
2 3 1
3 5 1
4 7 1