Home > Blockchain >  Overwrite dataframe values with an exact number of random NAs per column
Overwrite dataframe values with an exact number of random NAs per column

Time:02-28

I'm using this code to generate a random number of NAs within a dataframe. Here's an example

set.seed(1)
df <- mtcars[1:10,]
df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.7, 0.3), size = length(cc), replace = TRUE) ]))

> df
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  21.0   6    NA 110   NA 2.620    NA  0  1    4    4
2  21.0   6 160.0 110 3.90    NA 17.02 NA NA    4    4
3  22.8   4 108.0  93   NA 2.320 18.61  1  1    4    1
4    NA   6 258.0 110 3.08 3.215 19.44  1  0   NA   NA
5  18.7  NA 360.0  NA 3.15 3.440 17.02  0 NA   NA    2
6    NA   6 225.0 105   NA 3.460 20.22 NA  0   NA    1
7    NA  NA 360.0  NA 3.21 3.570 15.84 NA NA    3    4
8  24.4  NA 146.7  62 3.69 3.190    NA  1  0    4    2
9  22.8   4    NA  NA   NA 3.150 22.90 NA  0   NA   NA
10 19.2  NA 167.6 123 3.92 3.440    NA NA  0    4    4

It's useful but NAs are inconsistent per column across the dataframe. I would like to have an exact number of NAs per column. Is there a way to create exactly 3 random NAs per column? Many thanks

CodePudding user response:

We may sample the row_number() to replace the column with exact number of NAs

library(dplyr)
df1 <- df %>%
   mutate(across(everything(),
     ~ replace(.x, sample(row_number(), 3), NA)))

-output

df1
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0  NA 160.0  NA 3.90    NA    NA  0  1   NA    4
Mazda RX4 Wag     21.0  NA    NA 110 3.90 2.875 17.02  0 NA    4    4
Datsun 710        22.8   4    NA  NA 3.85 2.320 18.61  1  1   NA    1
Hornet 4 Drive      NA   6 258.0 110 3.08 3.215 19.44  1 NA   NA    1
Hornet Sportabout 18.7  NA 360.0  NA   NA 3.440    NA NA  0    3    2
Valiant           18.1   6 225.0 105   NA 3.460 20.22  1  0    3   NA
Duster 360          NA   8    NA 245 3.21    NA 15.84  0  0    3   NA
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            NA   4 140.8  95   NA 3.150 22.90 NA NA    4    2
Merc 280          19.2   6 167.6 123 3.92    NA    NA NA  0    4   NA

In base R, we do the same step by looping over the columns with lapply

df[] <- lapply(df, \(x) replace(x, sample(seq_along(x), 3), NA))
  •  Tags:  
  • r
  • Related