From this pandas df
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
samples_indices = df.sample(frac=0.5, replace=False).index
df.loc[samples_indices] = 'X'
will assign 'X' to all columns in randomly selected rows corresponding to 50% of df, like so:
X X X X
1 1 1 1
X X X X
1 1 1 1
But how do I assign 'X' to 50% randomly selected cells in the df?
For example like this:
X X X 1
1 X 1 1
X X X 1
1 1 1 X
CodePudding user response:
Use numpy and boolean indexing, for an efficient solution:
import numpy as np
df[np.random.choice([True, False], size=df.shape)] = 'X'
# with a custom probability:
N = 0.5
df[np.random.choice([True, False], size=df.shape, p=[N, 1-N])] = 'X'
Example output:
0 1 2 3
0 X 1 X X
1 X X 1 X
2 X X X 1
3 X X 1 X
If you need an exact proportion, you can use:
frac = 0.5
df[np.random.permutation(df.size).reshape(df.shape)>=df.size*frac] = 'X'
Example:
0 1 2 3
0 X 1 X 1
1 X 1 X 1
2 1 1 X 1
3 X X 1 X
CodePudding user response:
In @mozway's answer you can set to 'X' cells with a certain probability. But let's say you want to have exactly 50% of your data being marked as 'X'. This is how you can do it:
import numpy as np
df[np.random.permutation(np.hstack([np.ones(df.size // 2), np.zeros(df.size // 2)])).astype(bool).reshape(df.shape)] = 'X'
Example output:
X X X 1
1 X 1 1
X X X 1
1 1 1 X
CodePudding user response:
Create MultiIndex Series
by DataFrame.stack
, then use Series.sample
and last replace removed values by X
in Series.unstack
:
N = 0.5
df = (df.stack().sample(frac=1-N).unstack(fill_value='X')
.reindex(index=df.index, columns=df.columns, fill_value='X'))
print (df)
0 1 2 3
0 X X 1 1
1 X 1 X 1
2 1 X X X
3 1 1 1 X