Delete columns with over 5% of the values being "NULL"-CodePudding

Within my dataset, I'm attempting to delete all columns that have over 5% of the values being NULL. My friend wrote it with a bunch of if statements, as shown below:

gone <- c()
threshold <- nrow(data)*.05
if (sum(data$columnname == "NULL") > threshold){
  gone <- append(gone,"columnname")
}

He repeats this multiple times and then at the end has the following to delete them:

print(gone)
data <- data[ , !(names(data) %in% gone)]
summary(data)

I'm trying to replicate this code using a loop, instead of 20x if statements, but can't get it to work, does anyone know how?

CodePudding user response：

df[colMeans(df == 'NULL') <0.05]
                   cyl  disp drat
Mazda RX4            6 160.0 3.90
Mazda RX4 Wag        6 160.0 3.90
Datsun 710           4 108.0 3.85
Hornet 4 Drive       6 258.0 3.08
Hornet Sportabout    8 360.0 3.15
Valiant              6 225.0 2.76
Duster 360           8 360.0 3.21
Merc 240D            4 146.7 3.69
Merc 230             4 140.8 3.92
Merc 280             6 167.6 3.92
Merc 280C            6 167.6 3.92
Merc 450SE           8 275.8 3.07
Merc 450SL           8 275.8 3.07
Merc 450SLC          8 275.8 3.07
Cadillac Fleetwood   8 472.0 2.93

CodePudding user response：

With base R, you can just have a condition when you subset. I use colSums to get the total number of rows in a column that have NULL, then divide by total number of rows and use the 0.05 threshold.

df[colSums(df == "NULL")/nrow(df) < 0.05]

Output

                   cyl  disp drat
Mazda RX4            6 160.0 3.90
Mazda RX4 Wag        6 160.0 3.90
Datsun 710           4 108.0 3.85
Hornet 4 Drive       6 258.0 3.08
Hornet Sportabout    8 360.0 3.15
Valiant              6 225.0 2.76
Duster 360           8 360.0 3.21
Merc 240D            4 146.7 3.69
Merc 230             4 140.8 3.92
Merc 280             6 167.6 3.92
Merc 280C            6 167.6 3.92
Merc 450SE           8 275.8 3.07
Merc 450SL           8 275.8 3.07
Merc 450SLC          8 275.8 3.07
Cadillac Fleetwood   8 472.0 2.93

Data

df <- structure(list(mpg = c("21", "21", "NULL", "NULL", "18.7", "18.1", 
"14.3", "24.4", "22.8", "19.2", "NULL", "16.4", "NULL", "15.2", 
"NULL"), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8), 
    disp = c(160, 160, 108, 258, 360, 225, 360, 146.7, 140.8, 
    167.6, 167.6, 275.8, 275.8, 275.8, 472), hp = c("NULL", "NULL", 
    "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "95", "123", 
    "123", "180", "180", "180", "205"), drat = c(3.9, 3.9, 3.85, 
    3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 
    3.07, 2.93)), row.names = c("Mazda RX4", "Mazda RX4 Wag", 
"Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", "Valiant", 
"Duster 360", "Merc 240D", "Merc 230", "Merc 280", "Merc 280C", 
"Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood"
), class = "data.frame")

CodePudding user response：

An alternative in base:

Filter(\(x) mean(x == "NULL") < 0.05, df)

and its equivalent in dplyr:

dplyr::select(df, where(~ mean(.x == "NULL") < 0.05))