Home > Net >  How to count column values in ranges rather than bins?
How to count column values in ranges rather than bins?

Time:11-25

I am trying to find out how to neatly count the number of values in a dataframe column that fall into multiple custom ranges. I realize binning is the common technique for this, but I am not sure that is what I am looking for here.

Let's say I have the following dataframe:

            Date     Value1     Value2
---------------------------------------
 0     4/15/2018          4          4 
 1     4/16/2018          4          3
 2     4/17/2018          2          7 
 3     4/18/2018          4          3
 4     4/19/2018          3          8 
 5     4/20/2018          9          4 
 6     4/21/2018          9          6
 7     4/22/2018          8          3
 8     4/23/2018          9          5
 9     4/24/2018          9          5 
10     4/25/2018         10          5
11     4/26/2018         15          2
12     4/27/2018         18          2
13     4/28/2018         17          3
14     4/29/2018          6          4 
...

Now, I have three custom ranges (all inclusive): 1 - 4, 7 - 10, and 13 - 20.

I want to find out how many values (or rows) from "Value1" fall within each of these ranges. While I know binning can be used like bins = [0,4,7,13,20], this does not accomplish what I am looking for, since this tries to place/divide every value in the column into a bin. I just want to find the number of values (rows) that fall into my ranges. This means that that I do not care about the values of 5, 6, 11, 12, 21, 22, 23, etc. To understand the results, I would need to assign labels to my ranges so that 1 - 4 is labelled as "Range 1", 7 - 10 is labelled as "Range 2", and 13 - 20 is labelled as "Range 3".

And so I am trying to produce the following output dataframe:

            Date     Value1     Value2       Range
---------------------------------------------------
 0     4/15/2018          4          4      Range 1
 1     4/16/2018          4          3      Range 1
 2     4/17/2018          2          7      Range 1 
 3     4/18/2018          4          3      Range 1
 4     4/19/2018          3          8      Range 1 
 5     4/20/2018          9          4      Range 2 
 6     4/21/2018          9          6      Range 2
 7     4/22/2018          8          3      Range 2
 8     4/23/2018          9          5      Range 2
 9     4/24/2018          9          5      Range 2 
10     4/25/2018         10          5      Range 2
11     4/26/2018         15          2      Range 3
12     4/27/2018         18          2      Range 3
13     4/28/2018         17          3      Range 3
14     4/29/2018          6          4     No_range 

so that I can produce the following summary dataframe:

         Range    Frequency   
----------------------------
 0     Range 1            5
 1     Range 2            6  
 2     Range 3            3

How could this be accomplished in python?

CodePudding user response:

  1. Specify the ranges using Series.between (which includes both endpoints by default)
  2. Generate the Range column using np.select
ranges = {
    'Range 1': df['Value1'].between(1, 4),
    'Range 2': df['Value1'].between(7, 10),
    'Range 3': df['Value1'].between(13, 20),
}
df['Range'] = np.select(ranges.values(), ranges.keys(), default='No range')

#          Date  Value1  Value2     Range
# 0   4/15/2018       4       4   Range 1
# 1   4/16/2018       4       3   Range 1
# 2   4/17/2018       2       7   Range 1
# 3   4/18/2018       4       3   Range 1
# 4   4/19/2018       3       8   Range 1
# 5   4/20/2018       9       4   Range 2
# 6   4/21/2018       9       6   Range 2
# 7   4/22/2018       8       3   Range 2
# 8   4/23/2018       9       5   Range 2
# 9   4/24/2018       9       5   Range 2
# 10  4/25/2018      10       5   Range 2
# 11  4/26/2018      15       2   Range 3
# 12  4/27/2018      18       2   Range 3
# 13  4/28/2018      17       3   Range 3
# 14  4/29/2018       6       4  No range

CodePudding user response:

Use:

bins = pd.IntervalIndex.from_tuples([(1, 4), (7, 10), (13, 20)], closed="both")
lookup = {val: label for val, label in zip(bins.values, [f"Range {i}" for i in range(1, 4)])}
res = pd.cut(df["Value1"], bins=bins).cat.rename_categories(lookup).astype("string").fillna("No Range")
print(res)

Output

0      Range 1
1      Range 1
2      Range 1
3      Range 1
4      Range 1
5      Range 2
6      Range 2
7      Range 2
8      Range 2
9      Range 2
10     Range 2
11     Range 3
12     Range 3
13     Range 3
14    No Range
Name: Value1, dtype: string
  • Related