Randomly select cells in df pandas-CodePudding

From this pandas df

1   1   1   1
1   1   1   1
1   1   1   1
1   1   1   1

samples_indices = df.sample(frac=0.5, replace=False).index
df.loc[samples_indices] = 'X'

will assign 'X' to all columns in randomly selected rows corresponding to 50% of df, like so:

X   X   X   X
1   1   1   1
X   X   X   X
1   1   1   1

But how do I assign 'X' to 50% randomly selected cells in the df?
For example like this:

X   X   X   1
1   X   1   1
X   X   X   1
1   1   1   X

CodePudding user response：

Use numpy and boolean indexing, for an efficient solution:

import numpy as np

df[np.random.choice([True, False], size=df.shape)] = 'X'

# with a custom probability:
N = 0.5
df[np.random.choice([True, False], size=df.shape, p=[N, 1-N])] = 'X'

Example output:

   0  1  2  3
0  X  1  X  X
1  X  X  1  X
2  X  X  X  1
3  X  X  1  X

If you need an exact proportion, you can use:

frac = 0.5
df[np.random.permutation(df.size).reshape(df.shape)>=df.size*frac] = 'X'

Example:

   0  1  2  3
0  X  1  X  1
1  X  1  X  1
2  1  1  X  1
3  X  X  1  X

CodePudding user response：

In @mozway's answer you can set to 'X' cells with a certain probability. But let's say you want to have exactly 50% of your data being marked as 'X'. This is how you can do it:

import numpy as np

df[np.random.permutation(np.hstack([np.ones(df.size // 2), np.zeros(df.size // 2)])).astype(bool).reshape(df.shape)] = 'X'

Example output:

X   X   X   1
1   X   1   1
X   X   X   1
1   1   1   X

CodePudding user response：

Create MultiIndex Series by DataFrame.stack, then use Series.sample and last replace removed values by X in Series.unstack:

N = 0.5
df = (df.stack().sample(frac=1-N).unstack(fill_value='X')
         .reindex(index=df.index, columns=df.columns, fill_value='X'))
print (df)
   0  1  2  3
0  X  X  1  1
1  X  1  X  1
2  1  X  X  X
3  1  1  1  X