Home > front end >  Select and modify a random subset of the dataframe's elements
Select and modify a random subset of the dataframe's elements

Time:01-05

I have a dataframe structured like this:

1 8 9  
6 4 9  
5 4 8    

I want to random take 50% data and then change them to 1 in this dataframe.

Like

1 8 9 
6 1 1 
1 4 8 

I just found DataFrame.sample but it looks just can chose row and columns.

CodePudding user response:

df[np.random.random(df.shape) > .5] = 1

np.random.random(df.shape) will create an array of random floats between 0 and 1 of the same shape as df. Comparing this to .5 will create a boolean array where True and False are evenly distributed. This can then be used as a mask to set values to 1.

CodePudding user response:

As I read there was some debate on the probabilistic(*) versus exact number of cells that are selected, here is a solution to randomly select an exact number of cells to modify.

(*) probabilistic means that on average, 50% of the cells will be selected, but it could be that by chance there is significantly less or more than the average for a given occurrence.

It is using random.sample to select a fixed number of cells from a flat index of the array. Then numpy.unravel_index to transform it into indices relative to the original shape of the data. Finally, slicing occurs at the level of the underlying numpy array (only works with homogeneous dtype).

import random
import numpy as np

N = df.size//2 # here = 4

idx = np.unravel_index(random.sample(range(df.size), N), df.shape)

df.values[idx] = -1 # using -1 here for clarity

Example output, exactly 4 cells reproducibly:

   A  B  C
0 -1  8  9
1  6 -1 -1
2  5  4 -1
handling mixed type arrays/dataframes

We just need to create an array of booleans and use pandas.DataFrame.where:

idx = np.unravel_index(random.sample(range(df.size), N), df.shape)
a = np.zeros(df.shape)
a[idx] = 1
df[a.astype(bool)] = np.nan
df2 = df.mask(a.astype(bool), -1)
  •  Tags:  
  • Related