I have a dataframe as follow:
key1 key2 val
0 a x 8
1 a x 6
2 a x 7
3 a x 4
4 a x 9
5 a x 1
6 a x 2
7 a x 3
8 a x 10
9 a x 5
10 a y 4
11 a y 9
12 a y 1
13 a y 2
14 b x 17
15 b x 15
16 b x 18
17 b x 19
18 b x 12
19 b x 20
20 b x 14
21 b x 13
22 b x 16
23 b x 11
24 b y 2
25 b y 3
26 b y 10
27 b y 5
28 b y 4
29 b y 24
30 b y 22
What I need to do is:
- Access each group by
key1
- In each group of
key1
, I need to doqcut
on observations thatkey2 == x
- For those observation that is out of bin range, assign them to lowest and highest bins
According to the dataframe above, first group key1 = a
is from indx=0-13. However, only the indx from 0-9 are used to create bins(threshold). The bins(threshold) is then applied from indx=0-13
Then for second group key1 = b
is from indx=14-30. Only indx from 14-23 are used to creates bins(threshold). The bins(threshold) is then applied from indx=14-30.
However, from indx=24-28 and indx=29-30, they are out of bins range. Then for indx=24-28 assign to smallest bin range, indx=29-30 assign to the largest bin range.
The output looks like this:
key1 key2 val labels
0 a x 8 1
1 a x 6 1
2 a x 7 1
3 a x 4 0
4 a x 9 1
5 a x 1 0
6 a x 2 0
7 a x 3 0
8 a x 10 1
9 a x 5 0
10 a y 4 0
11 a y 9 1
12 a y 1 0
13 a y 2 0
14 b x 17 1
15 b x 15 0
16 b x 18 1
17 b x 19 1
18 b x 12 0
19 b x 20 1
20 b x 14 0
21 b x 13 0
22 b x 16 1
23 b x 11 0
24 b y 2 0
25 b y 3 0
26 b y 10 0
27 b y 5 0
28 b y 4 0
29 b y 24 1
30 b y 22 1
My solution: I creates a dict
to contain bins as: (for simplicity, take qcut=2)
dict_bins = {}
key_unique = data['key1'].unique()
for k in key_unique:
sub = data[(data['key1'] == k) & (data['key2'] == 'x')].copy()
dict_bins[k] = pd.qcut(sub['val'], 2, labels=False, retbins=True )[1]
Then, I intend to use groupby
with apply
, but get stuck on accessing dict_bins
data['sort_key1'] = data.groupby(['key1'])['val'].apply(lambda g: --- stuck---)
Any other solution, or modification to my solution is appreciated.
Thank you
CodePudding user response:
A first approach is to create a custom function:
def discretize(df):
bins = pd.qcut(df.loc[df['key2'] == 'x', 'val'], 2, labels=False, retbins=True)[1]
bins = [-np.inf] bins[1:-1].tolist() [np.inf]
return pd.cut(df['val'], bins, labels=False)
df['label'] = df.groupby('key1').apply(discretize).droplevel(0)
Output:
>>> df
key1 key2 val label
0 a x 8 1
1 a x 6 1
2 a x 7 1
3 a x 4 0
4 a x 9 1
5 a x 1 0
6 a x 2 0
7 a x 3 0
8 a x 10 1
9 a x 5 0
10 a y 4 0
11 a y 9 1
12 a y 1 0
13 a y 2 0
14 b x 17 1
15 b x 15 0
16 b x 18 1
17 b x 19 1
18 b x 12 0
19 b x 20 1
20 b x 14 0
21 b x 13 0
22 b x 16 1
23 b x 11 0
24 b y 2 0
25 b y 3 0
26 b y 10 0
27 b y 5 0
28 b y 4 0
29 b y 24 1
30 b y 22 1
You need to drop the first level of index to align indexes:
>>> df.groupby('key1').apply(discretize)
key1 # <- you have to drop this index level
a 0 1
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 0
11 1
12 0
13 0
b 14 1
15 0
16 1
17 1
18 0
19 1
20 0
21 0
22 1
23 0
24 0
25 0
26 0
27 0
28 0
29 1
30 1
Name: val, dtype: int64