I have a Metric and I am trying to create new columns containing labels that describe the quantile each Metric value falls into. However, I am finding that, for the same Metric value, I receive multiple quantile labels when I expect them to only receive one.
I have a metric that looks like this:
Metric |
---|
5.0 |
5.0 |
6.0 |
5.0 |
NaN |
5.0 |
5.0 |
2.0 |
6.0 |
3.0 |
NaN |
2.0 |
5.0 |
5.0 |
5.0 |
2.0 |
I want to create new columns that contain the quantile labels for a range of quantiles. This is what I am doing:
var_metric = ['Metric']
# Quantile range I am interested in creating labels for
quantiles_list = [2, 3, 4, 5]
# Quantile labels
quantile_2_labels = ['Quantile_2_1','Quantile_2_2']
quantile_3_labels = ['Quantile_3_1','Quantile_3_2','Quantile_3_3']
quantile_4_labels = ['Quantile_4_1','Quantile_4_2','Quantile_4_3','Quantile_4_4']
quantile_5_labels = ['Quantile_5_1','Quantile_5_2','Quantile_5_3','Quantile_5_4','Quantile_5_5']
quantiles_labels_list = [quantile_2_labels, quantile_3_labels, quantile_4_labels, quantile_5_labels]
# Loops for creating label columns
for metric in var_metric:
for quantile, labels in zip(quantiles_list, quantiles_labels_list):
df[f'{metric}_Quantile_{quantile}'] = pd.qcut(df[metric].astype('Int64').rank(method='first'), q=quantile, labels=labels, retbins=False)
The result looks like this:
Metric | Metric_Quantile_2 | Metrtic_Quantile_3 | Metric_Quantile_4 | Metric_Quantile_5 |
---|---|---|---|---|
2.0 | Quantile_2_1 | Quantile_3_1 | Quantile_4_1 | Quantile_5_1 |
5.0 | Quantile_2_1 | Quantile_3_2 | Quantile_4_2 | Quantile_5_2 |
5.0 | Quantile_2_1 | Quantile_3_2 | Quantile_4_2 | Quantile_5_3 |
6.0 | Quantile_2_2 | Quantile_3_3 | Quantile_4_4 | Quantile_5_5 |
5.0 | Quantile_2_1 | Quantile_3_2 | Quantile_4_2 | Quantile_5_3 |
NaN | NaN | NaN | NaN | NaN |
5.0 | Quantile_2_2 | Quantile_3_2 | Quantile_4_3 | Quantile_5_3 |
5.0 | Quantile_2_2 | Quantile_3_2 | Quantile_4_3 | Quantile_5_4 |
2.0 | Quantile_2_1 | Quantile_3_1 | Quantile_4_1 | Quantile_5_1 |
6.0 | Quantile_2_2 | Quantile_3_3 | Quantile_4_4 | Quantile_5_5 |
3.0 | Quantile_2_1 | Quantile_3_1 | Quantile_4_2 | Quantile_5_2 |
NaN | NaN | NaN | NaN | NaN |
2.0 | Quantile_2_1 | Quantile_3_1 | Quantile_4_1 | Quantile_5_1 |
5.0 | Quantile_2_2 | Quantile_3_3 | Quantile_4_3 | Quantile_5_4 |
5.0 | Quantile_2_2 | Quantile_3_3 | Quantile_4_4 | Quantile_5_4 |
5.0 | Quantile_2_2 | Quantile_3_3 | Quantile_4_4 | Quantile_5_5 |
2.0 | Quantile_2_1 | Quantile_3_1 | Quantile_4_1 | Quantile_5_2 |
Using column Metric_Quantile_2
as an example, I expect all cases where Metric = 5.0
to receive the same label, however you can see in the second column that the labels for the 5.0
value shift between Quantile_2_1
and Quantile_2_2
. Any idea why?
Thanks!
CodePudding user response:
You can't expect your code produce the right result because you don't apply qcut
on df['Metric']
but on df['Metric'].rank(...)
which is not the same thing:
>>> pd.concat([df[metric], df[metric].rank(method='first')], keys=['Metric', 'Rank'], axis=1)
Metric Rank
0 5.0 5.0
1 5.0 6.0
2 6.0 13.0
3 5.0 7.0
4 NaN NaN
5 5.0 8.0
6 5.0 9.0
7 2.0 1.0
8 6.0 14.0
9 3.0 4.0
10 NaN NaN
11 2.0 2.0
12 5.0 10.0
13 5.0 11.0
14 5.0 12.0
15 2.0 3.0
What works:
>>> pd.concat([df, pd.qcut(df[metric], q=quantile, labels=labels, retbins=False)], axis=1)
Metric Metric
0 5.0 Quantile_2_1 # OK
1 5.0 Quantile_2_1 # OK
2 6.0 Quantile_2_2
3 5.0 Quantile_2_1 # OK
4 NaN NaN
5 5.0 Quantile_2_1 # OK
6 5.0 Quantile_2_1 # OK
7 2.0 Quantile_2_1
8 6.0 Quantile_2_2
9 3.0 Quantile_2_1
10 NaN NaN
11 2.0 Quantile_2_1
12 5.0 Quantile_2_1 # OK
13 5.0 Quantile_2_1 # OK
14 5.0 Quantile_2_1 # OK
15 2.0 Quantile_2_1
What doesn't work:
>>> pd.concat([df, pd.qcut(df[metric].rank(method='first'), q=quantile, labels=labels, retbins=False)], axis=1)
Metric Metric
0 5.0 Quantile_2_1 # OK but lucky, rank=5, first quantile
1 5.0 Quantile_2_1 # OK but lucky, rank=6, first quantile
2 6.0 Quantile_2_2
3 5.0 Quantile_2_1 # OK but lucky, rank=7, first quantile
4 NaN NaN
5 5.0 Quantile_2_2 # KO, rank=8, second quantile
6 5.0 Quantile_2_2 # KO, rank=9, second quantile
7 2.0 Quantile_2_1
8 6.0 Quantile_2_2
9 3.0 Quantile_2_1
10 NaN NaN
11 2.0 Quantile_2_1
12 5.0 Quantile_2_2 # KO, rank=10, second quantile
13 5.0 Quantile_2_2 # KO, rank=11, second quantile
14 5.0 Quantile_2_2 # KO, rank=12, second quantile
15 2.0 Quantile_2_1
CodePudding user response:
The issue is with pd.qcut
which tries to put an equal number of observations into each quantile bin. However, if you have many occurrences of the same value which forces some bins to contain more observations than others, it does not know how to deal with this.
This post breaks down the options that you have in this case: How to qcut with non unique bin edges?
In the end, I decided to switch to pd.cut
which takes the min and max of your metric's value range, and then splits the range up into equal size bins, and then puts your observations into these bins. As a result, you end up with a different number of observations in each bin.
This was my forloop in the end:
for metric in var_metric:
for quantile, labels in zip(quantiles_list, quantiles_labels_list):
df[f'{metric}_Quantile_{quantile}'] = pd.cut(df[metric].astype('Int64'), bins=quantile, labels=labels, retbins=False)
I hope that is clear. Thanks everyone for your help!