Randomly select 50% of records from 3 different groups for A/B test-CodePudding

Apologies if this has been asked already. I am trying to setup a small A/B test and split the records evenly (50%) across 3 categories: Low intent, Medium intent, High intent. I'd like to randomly select 50% of each of the 3 categories to a control group and 50% to a treatment group to another column.

Sample Data:

|ID|Buyer Intent  |Email
:--:|:-----------:|:-------------|
|1  |Low Intent   |[email protected]|
|2  |Medium Intent|[email protected]|
|3  |Medium Intent|[email protected] |
|4  |Low Intent   |[email protected]|
|5  |High Intent  |[email protected]|
|6  |High Intent  |[email protected]|

Desired Data:

|ID|Buyer Intent |Email           |Group
:--|:-----------:|:--------------:|:----------:|
|1 |Low Intent   |[email protected]  |Control     |
|2 |Medium Intent|[email protected]  |Treatment   |
|3 |Medium Intent|[email protected]   |Control     |
|4 |Low Intent   |[email protected]  |Treatment.  |
|5 |High Intent  |[email protected]  |Treatment.  |
|6 |High Intent  |[email protected]  | Control.   |

CodePudding user response：

Use groupby.sample to choose 50% records per group and then assign the labels with np.where:

control = df.groupby('Buyer Intent').sample(frac=0.5).index

df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')

#    ID   Buyer Intent           Email      Group
# 0   1     Low Intent  [email protected]    Control
# 1   2  Medium Intent  [email protected]    Control
# 2   3  Medium Intent   [email protected]  Treatment
# 3   4     Low Intent  [email protected]  Treatment
# 4   5    High Intent  [email protected]    Control
# 5   6    High Intent  [email protected]  Treatment

Note that groupby.sample already randomizes:

Return a random sample of items from each group.

But to shuffle explicitly, you can add DataFrame.sample with frac=1:

# shuffle df
df = df.sample(frac=1)

# same as before
control = df.groupby('Buyer Intent').sample(frac=0.5).index
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')

If you don't have groupby.sample (pandas < 1.1.0):

Try groupby.apply DataFrame.sample:

control = df.groupby('Buyer Intent').apply(lambda g: g.sample(frac=0.5))
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')

Or groupby.apply np.random.choice:

control = df.groupby('Buyer Intent').apply(lambda g: np.random.choice(g.index, int(len(g)/2)))
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')

CodePudding user response：

To choose 50% you have to use something that is to return a random sample of items from each group, this thing is called "groupby.sample".
Next you need something to return chosen items depending on the condition, this thing is called np.where.