I have a dataframe like as shown below
ID raw_val var_name constant s_value
1 388 Qty 0.36 -0.032
2 120 Qty 0.36 -0.007
3 34 Qty 0.36 0.16
4 45 Qty 0.36 0.31
1 110 F1 0.36 -0.232
2 1000 F1 0.36 -0.17
3 318 F1 0.36 0.26
4 419 F1 0.36 0.31
My objective is to
a) Find the upper and lower limits (of raw_val
) for each value of var_name
for s_value
>=0
b) Find the upper and lower limits (of raw_val
) for each value of var_name
for s_value
<0
I tried the below
df['sign'] = np.where[df['s_value']<0, 'neg', 'pos']
s = df.groupby(['var_name','sign'])['raw_val'].series
df['buckets'] = pd.IntervalIndex.from_arrays(s)
Please note that my real data is big data and has more than 200 unique values for var_name
column. The distribution of positive and negative values (s_value
) may be uneven for each value of the var_name
columns. In sample df, I have shown even distribution of pos and neg values but it may not be the case in real life.
I expect my output to be like as below
var_name sign low_limit upp_limit
Qty neg 120 388
F1 neg 110 1000
Qty pos 34 45
Qty pos 318 419
CodePudding user response:
I think numpy.where
with aggregate minimal and maximal values is way:
df['sign'] = np.where(df['s_value']<0, 'neg', 'pos')
df1 = (df.groupby(['var_name','sign'], sort=False, as_index=False)
.agg(low_limit=('raw_val','min'), upp_limit=('raw_val','max')))
print (df1)
var_name sign low_limit upp_limit
0 Qty neg 120 388
1 Qty pos 34 45
2 F1 neg 110 1000
3 F1 pos 318 419