I have my table with ab test results by segments like this
import pandas as pd
import numpy as np
data = [['a', 'segment1', 12,14], ['a', 'segment1', 12,14], ['b', 'segment2', 12,11],['a', 'segment2', 10,11],
['b', 'segment1', 4,5], ['b', 'segment1', 32,15], ['b', 'segment2', 14,8],['a', 'segment2', 11,21],
['b', 'segment1', 1,21], ['b', 'segment1', 4,21], ['a', 'segment2', 6,32],['b', 'segment2', 3,21],
]
df_data = pd.DataFrame(data, columns = ['test_group', 'segment', 'feature1', 'feature2'])
df_data
test_group | segment | feature1 | feature2 |
---|---|---|---|
a | segment1 | 12 | 14 |
a | segment1 | 12 | 14 |
b | segment2 | 12 | 11 |
a | segment2 | 10 | 11 |
... | ... | ... | ... |
And I want to calculate MannWhitney U-test(
scipy.stats.mannwhitneyu(group_a,
group_b,
use_continuity=True, alternative='two-sided')
) group by segments. Desirable table should look like this
segment | test_group | feature1 | feature2 |
---|---|---|---|
segment1 | a | ||
segment1 | b | p_value | p_value |
segment2 | a | ||
segment2 | b | p_value | p_value |
CodePudding user response:
EDIT: Here is an arguably cleaner solution where we first pivot the df_data
to be test_group
by segment
with lists as the entries. Then we can apply the mwu across the columns
mwu_df = df_data.pivot_table(
index='test_group',
columns='segment',
values=['feature1','feature2'],
aggfunc=list,
).apply(lambda v:
scipy.stats.mannwhitneyu(*v,use_continuity=True,alternative='two-sided').pvalue,
axis=0
).reset_index(level='segment',name='mwu_p')
mwu_df
Output:
segment mwu_p
feature1 segment1 0.474549
feature1 segment2 0.662521
feature2 segment1 0.474549
feature2 segment2 0.368688
Previous idea:
This idea is to first melt the df_data
so that we can then group on segment
and feature
. Then apply the mwu test to each group, which is done in a kind of ugly way with apply
and a custom lambda function.
Also note that the output format here is not as you specified
mwu_df = df_data.melt(
id_vars=['segment','test_group'],
value_vars=['feature1','feature2'],
var_name='feature',
).groupby(['segment','feature']).apply(
lambda g:
scipy.stats.mannwhitneyu(
g.loc[g['test_group'].eq('a'),'value'],
g.loc[g['test_group'].eq('b'),'value'],
use_continuity=True,
alternative='two-sided',
).pvalue
).reset_index(name='mwu_p')
mwu_df
Output:
segment feature mwu_p
0 segment1 feature1 0.474549
1 segment1 feature2 0.474549
2 segment2 feature1 0.662521
3 segment2 feature2 0.368688