I have the a data frame like this:
data = {'Result1':[10, 11, 12, 44],
'Result2':[6, 7, 55, 114],
'ID':['A', 'B', 'A', 'B'],
'anoter_id':["one", "one", "two", "two"],
'date' : ['2022-06-06', '2022-06-06', '2022-06-01', '2022-06-01']
}
df = pd.DataFrame(data)
And I want to group by date
and anoter_id
and keep the lines where Result1
and Result2
of id==B
is 10% bigger or lower than Result1and
Result2of
id==A`.
I also want to have a column with this difference, like "Result1_B/Result1_A"
and "Result2_B/Result2_A"
My initial idea was to do this:
df.query("ID == 'A'").sort_values('date').Result1/df.query("ID == 'B'").sort_values('date').Result1
Moreover is super verbose and I have to write on line for each column on my dataset.
Any ideas?
EDIT:
An example of output is below.
On Result1_B/Result1_A_perDate
the results come from 11/10 and 7/6, that corresponds to Result1 of id B divided by Result1 of id A for anoter_id
'one' . The column inside_10%_limit
is TRUE if both columns Result1_B/Result1_A_perDate
and Result2_B/Result2_A_perDate
are less or more than maximum 10%
output = {'Result1_B/Result1_A_perDate':[1.1, 1.16],
'Result2_B/Result2_A_perDate':[3.6, 2.0],
'anoter_id':["one", "two"],
'date' : ['2022-06-06', '2022-06-01'],
'inside_10%_limit': ['True', 'False']
}
pd.DataFrame(output)
CodePudding user response:
The logic is still not fully clear, but here is my assumption:
# pivot DataFrame
df2 = df.pivot(index=['anoter_id', 'date'], columns='ID')
# Use MultiIndex selection to divide B by A
out = df2.xs('B', level=1, axis=1).div(df2.xs('A', level=1, axis=1)).add_suffix('_B/A')
# get top 10% percentile
thresh = out.stack().quantile(0.9, interpolation='nearest')
# 3.6666
# identify rows with both values above or below threshold
out['10%lim'] = out.gt(thresh).all(1) | out.lt(thresh).all(1)
# index as columns
out = out.reset_index()
output:
anoter_id date Result1_B/A Result2_B/A 10%lim
0 one 2022-06-06 1.100000 1.166667 True
1 two 2022-06-01 3.666667 2.072727 False