Suppose I have the following dataframe:
Type Name
S2019 John
S2019 Stephane
S2019 Mike
S2019 Hamid
S2021 Rahim
S2021 Ahamed
I want to groupby the dataset based on "Type" and then add a new column named as "Sampled" and randomly add yes/no to each row, the yes/no should be distributed equally. The expected dataframe can be:
Type Name Sampled
S2019 John no
S2019 Stephane yes
S2019 Mike yes
S2019 Hamid no
S2021 Rahim yes
S2021 Ahamed no
CodePudding user response:
You can use numpy.random.choice
:
import numpy as np
df['Sampled'] = np.random.choice(['yes', 'no'], size=len(df))
output:
Type Name Sampled
0 S2019 John no
1 S2019 Stephane no
2 S2019 Mike yes
3 S2019 Hamid no
4 S2021 Rahim no
5 S2021 Ahamed yes
equal probability per group:
df['Sampled'] = (df.groupby('Type')['Type']
.transform(lambda g: np.random.choice(['yes', 'no'],
size=len(g)))
)
For each group, get an arbitrary column (here Type, but it doesn't matter, this is just to have a shape of 1), and apply np.random.choice
with the length of the group as parameter. This gives as many yes or no as the number of items in the group with an equal probability (note that you can define a specific probability per item if you want).
NB. equal probability does not mean you will get necessarily 50/50 of yes/no, if this is what you want please clarify
half yes/no per group
If you want half each kind (yes/no) (±1 in case of odd size), you can select randomly half of the indices.
idx = df.groupby('Type', group_keys=False).apply(lambda g: g.sample(n=len(g)//2)).index
df['Sampled'] = np.where(df.index.isin(idx), 'yes', 'no')
NB. in case of odd number, there will be one more of the second item defined in the np.where
function, here "no".
distribute equally many elements:
This will distribute equally, in the limit of multiplicity. This means, for 3 elements and 4 places, there will be two a, one b, one c in random order. If you want the extra item(s) to be chosen randomly, first shuffle the input.
elem = ['a', 'b', 'c']
df['Sampled'] = (df
.groupby('Type', group_keys=False)['Type']
.transform(lambda g: np.random.choice(np.tile(elem, int(np.ceil(len(g)/len(elem))))[:len(g)],
size=len(g), replace=False))
)
output:
Type Name Sampled
0 S2019 John a
1 S2019 Stephane a
2 S2019 Mike b
3 S2019 Hamid c
4 S2021 Rahim a
5 S2021 Ahamed b
CodePudding user response:
Use custom function in GroupBy.transform
with create helper array arr
by equally distibuted values yes, no
and then randomize order by numpy.random.shuffle
:
def f(x):
arr = np.full(len(x), ['no'], dtype=object)
arr[:int(len(x) * 0.5)] = 'yes'
np.random.shuffle(arr)
return arr
df['Sampled'] = df.groupby('Type')['Name'].transform(f)
print (df)
Type Name Sampled
0 S2019 John yes
1 S2019 Stephane no
2 S2019 Mike no
3 S2019 Hamid yes
4 S2021 Rahim no
5 S2021 Ahamed yes
CodePudding user response:
You can assign an equal distribution of yes
and no
values to each Type
group by shuffling the dataFrame using sample
, then taking a cumcount
for each group of Type
and assigning a yes/no
result based on whether the cumcount
value is odd or even:
df['Sampled'] = (df.sample(frac=1).groupby('Type').cumcount() % 2 == 0).map({ True: 'yes', False: 'no'})
Sample output:
Type Name Sampled
0 S2019 John yes
1 S2019 Stephane yes
2 S2019 Mike no
3 S2019 Hamid no
4 S2021 Rahim yes
5 S2021 Ahamed no