Apologies if this has been asked already. I am trying to setup a small A/B test and split the records evenly (50%) across 3 categories: Low intent
, Medium intent
, High intent
. I'd like to randomly select 50% of each of the 3 categories to a control group and 50% to a treatment group to another column.
Sample Data:
|ID|Buyer Intent |Email
:--:|:-----------:|:-------------|
|1 |Low Intent |[email protected]|
|2 |Medium Intent|[email protected]|
|3 |Medium Intent|[email protected] |
|4 |Low Intent |[email protected]|
|5 |High Intent |[email protected]|
|6 |High Intent |[email protected]|
Desired Data:
|ID|Buyer Intent |Email |Group
:--|:-----------:|:--------------:|:----------:|
|1 |Low Intent |[email protected] |Control |
|2 |Medium Intent|[email protected] |Treatment |
|3 |Medium Intent|[email protected] |Control |
|4 |Low Intent |[email protected] |Treatment. |
|5 |High Intent |[email protected] |Treatment. |
|6 |High Intent |[email protected] | Control. |
CodePudding user response:
Use groupby.sample
to choose 50% records per group and then assign the labels with np.where
:
control = df.groupby('Buyer Intent').sample(frac=0.5).index
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')
# ID Buyer Intent Email Group
# 0 1 Low Intent [email protected] Control
# 1 2 Medium Intent [email protected] Control
# 2 3 Medium Intent [email protected] Treatment
# 3 4 Low Intent [email protected] Treatment
# 4 5 High Intent [email protected] Control
# 5 6 High Intent [email protected] Treatment
Note that groupby.sample
already randomizes:
Return a random sample of items from each group.
But to shuffle explicitly, you can add DataFrame.sample
with frac=1
:
# shuffle df
df = df.sample(frac=1)
# same as before
control = df.groupby('Buyer Intent').sample(frac=0.5).index
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')
If you don't have groupby.sample
(pandas < 1.1.0):
Try
groupby.apply
DataFrame.sample
:control = df.groupby('Buyer Intent').apply(lambda g: g.sample(frac=0.5)) df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')
Or
groupby.apply
np.random.choice
:control = df.groupby('Buyer Intent').apply(lambda g: np.random.choice(g.index, int(len(g)/2))) df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')
CodePudding user response:
To choose 50% you have to use something that is to return a random sample of items from each group, this thing is called "groupby.sample".
Next you need something to return chosen items depending on the condition, this thing is called np.where.