Python. Pandas. Calculate statistic difference group by-CodePudding

I have my table with ab test results by segments like this

import pandas as pd
import numpy as np
    
data = [['a', 'segment1', 12,14], ['a', 'segment1', 12,14], ['b', 'segment2', 12,11],['a', 'segment2', 10,11],
       ['b', 'segment1', 4,5], ['b', 'segment1', 32,15], ['b', 'segment2', 14,8],['a', 'segment2', 11,21],
       ['b', 'segment1', 1,21], ['b', 'segment1', 4,21], ['a', 'segment2', 6,32],['b', 'segment2', 3,21],
       ]
df_data = pd.DataFrame(data, columns = ['test_group', 'segment', 'feature1', 'feature2'])
df_data

test_group	segment	feature1	feature2
a	segment1	12	14
a	segment1	12	14
b	segment2	12	11
a	segment2	10	11
...	...	...	...

And I want to calculate MannWhitney U-test(

scipy.stats.mannwhitneyu(group_a, 
                         group_b,
                         use_continuity=True, alternative='two-sided')

) group by segments. Desirable table should look like this

segment	test_group	feature1	feature2
segment1	a
segment1	b	p_value	p_value
segment2	a
segment2	b	p_value	p_value

CodePudding user response：

EDIT: Here is an arguably cleaner solution where we first pivot the df_data to be test_group by segment with lists as the entries. Then we can apply the mwu across the columns

mwu_df = df_data.pivot_table(
    index='test_group',
    columns='segment',
    values=['feature1','feature2'],
    aggfunc=list,
).apply(lambda v: 
    scipy.stats.mannwhitneyu(*v,use_continuity=True,alternative='two-sided').pvalue,
    axis=0
).reset_index(level='segment',name='mwu_p')

mwu_df

Output:

             segment    mwu_p
feature1    segment1    0.474549
feature1    segment2    0.662521
feature2    segment1    0.474549
feature2    segment2    0.368688

Previous idea:

This idea is to first melt the df_data so that we can then group on segment and feature. Then apply the mwu test to each group, which is done in a kind of ugly way with apply and a custom lambda function.

Also note that the output format here is not as you specified

mwu_df = df_data.melt(
    id_vars=['segment','test_group'],
    value_vars=['feature1','feature2'],
    var_name='feature',
).groupby(['segment','feature']).apply(
    lambda g:
        scipy.stats.mannwhitneyu(
            g.loc[g['test_group'].eq('a'),'value'], 
            g.loc[g['test_group'].eq('b'),'value'],
            use_continuity=True,
            alternative='two-sided',
        ).pvalue
).reset_index(name='mwu_p')

mwu_df

Output:

    segment feature mwu_p
0   segment1    feature1    0.474549
1   segment1    feature2    0.474549
2   segment2    feature1    0.662521
3   segment2    feature2    0.368688