I am trying to find out how to neatly count the number of values in a dataframe column that fall into multiple custom ranges. I realize binning is the common technique for this, but I am not sure that is what I am looking for here.
Let's say I have the following dataframe:
Date Value1 Value2
---------------------------------------
0 4/15/2018 4 4
1 4/16/2018 4 3
2 4/17/2018 2 7
3 4/18/2018 4 3
4 4/19/2018 3 8
5 4/20/2018 9 4
6 4/21/2018 9 6
7 4/22/2018 8 3
8 4/23/2018 9 5
9 4/24/2018 9 5
10 4/25/2018 10 5
11 4/26/2018 15 2
12 4/27/2018 18 2
13 4/28/2018 17 3
14 4/29/2018 6 4
...
Now, I have three custom ranges (all inclusive): 1 - 4, 7 - 10, and 13 - 20.
I want to find out how many values (or rows) from "Value1" fall within each of these ranges. While I know binning can be used like bins = [0,4,7,13,20]
, this does not accomplish what I am looking for, since this tries to place/divide every value in the column into a bin. I just want to find the number of values (rows) that fall into my ranges. This means that that I do not care about the values of 5, 6, 11, 12, 21, 22, 23, etc. To understand the results, I would need to assign labels to my ranges so that 1 - 4 is labelled as "Range 1", 7 - 10 is labelled as "Range 2", and 13 - 20 is labelled as "Range 3".
And so I am trying to produce the following output dataframe:
Date Value1 Value2 Range
---------------------------------------------------
0 4/15/2018 4 4 Range 1
1 4/16/2018 4 3 Range 1
2 4/17/2018 2 7 Range 1
3 4/18/2018 4 3 Range 1
4 4/19/2018 3 8 Range 1
5 4/20/2018 9 4 Range 2
6 4/21/2018 9 6 Range 2
7 4/22/2018 8 3 Range 2
8 4/23/2018 9 5 Range 2
9 4/24/2018 9 5 Range 2
10 4/25/2018 10 5 Range 2
11 4/26/2018 15 2 Range 3
12 4/27/2018 18 2 Range 3
13 4/28/2018 17 3 Range 3
14 4/29/2018 6 4 No_range
so that I can produce the following summary dataframe:
Range Frequency
----------------------------
0 Range 1 5
1 Range 2 6
2 Range 3 3
How could this be accomplished in python?
CodePudding user response:
- Specify the ranges using
Series.between
(which includes both endpoints by default) - Generate the
Range
column usingnp.select
ranges = {
'Range 1': df['Value1'].between(1, 4),
'Range 2': df['Value1'].between(7, 10),
'Range 3': df['Value1'].between(13, 20),
}
df['Range'] = np.select(ranges.values(), ranges.keys(), default='No range')
# Date Value1 Value2 Range
# 0 4/15/2018 4 4 Range 1
# 1 4/16/2018 4 3 Range 1
# 2 4/17/2018 2 7 Range 1
# 3 4/18/2018 4 3 Range 1
# 4 4/19/2018 3 8 Range 1
# 5 4/20/2018 9 4 Range 2
# 6 4/21/2018 9 6 Range 2
# 7 4/22/2018 8 3 Range 2
# 8 4/23/2018 9 5 Range 2
# 9 4/24/2018 9 5 Range 2
# 10 4/25/2018 10 5 Range 2
# 11 4/26/2018 15 2 Range 3
# 12 4/27/2018 18 2 Range 3
# 13 4/28/2018 17 3 Range 3
# 14 4/29/2018 6 4 No range
CodePudding user response:
Use:
bins = pd.IntervalIndex.from_tuples([(1, 4), (7, 10), (13, 20)], closed="both")
lookup = {val: label for val, label in zip(bins.values, [f"Range {i}" for i in range(1, 4)])}
res = pd.cut(df["Value1"], bins=bins).cat.rename_categories(lookup).astype("string").fillna("No Range")
print(res)
Output
0 Range 1
1 Range 1
2 Range 1
3 Range 1
4 Range 1
5 Range 2
6 Range 2
7 Range 2
8 Range 2
9 Range 2
10 Range 2
11 Range 3
12 Range 3
13 Range 3
14 No Range
Name: Value1, dtype: string