How do I assign groups to python-pandas data based on integer column?-CodePudding

I have a data frame with sales by month and the price at each month's observation. Is there a way to take those prices, and break them up into 10 equal groups based on the price itself, so group 1 would be 10% with the lowest prices, and group 10 be 10% of the observations with the highest prices?

I tried pd.cut but I need to know where those 10% cut points are to bin correctly and I'm not sure how to get that when I have over 30K rows of data. I also need equal number of observations in each group.

Here is some test data:

from scipy.stats import poisson
import pandas as pd

data = poisson.rvs(mu=42.73, size=10000)
data_ = pd.DataFrame(data, columns=['price'])

Thank you.

CodePudding user response：

First let's create a dataframe to use as an example for OP's use case

import pandas as pd

sales = pd.DataFrame({'sales': [100, 200, 300, 400, 500, 600],
                        'price': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]})

Depending on one's goals one can use:

• pandas.cut

sales['group'] = pd.cut(sales['price'], 10, labels=False)

[Out]:
   sales  price  group
0    100    1.0      0
1    200    2.0      1
2    300    3.0      3
3    400    4.0      5
4    500    5.0      7

• pandas.qcut

sales['group'] = pd.qcut(sales['price'], 10, labels=False)

[Out]:
   sales  price  group
0    100    1.0      0
1    200    2.0      1
2    300    3.0      3
3    400    4.0      5
4    500    5.0      7

In this example the output is the same, but that is not necessarily the case. Therefore, for more information on the difference between pandas.cut and pandas.qcut, see this thread: What is the difference between pandas.qcut and pandas.cut?

Additionally, if one wants to see the labels for the resulting bins, change labels=False to labels=True.

Finally, in order to get the amount of observations in each group, one can use pandas.DataFrame.groupby. Let's create a new dataframe called group_counts

group_counts = sales.groupby('group').size().reset_index(name='counts')

[Out]:
   group  counts
0      0       1
1      1       1
2      3       1
3      5       1
4      7       1