Within my dataset, I'm attempting to delete all columns that have over 5% of the values being NULL. My friend wrote it with a bunch of if statements, as shown below:
gone <- c()
threshold <- nrow(data)*.05
if (sum(data$columnname == "NULL") > threshold){
gone <- append(gone,"columnname")
}
He repeats this multiple times and then at the end has the following to delete them:
print(gone)
data <- data[ , !(names(data) %in% gone)]
summary(data)
I'm trying to replicate this code using a loop, instead of 20x if statements, but can't get it to work, does anyone know how?
CodePudding user response:
df[colMeans(df == 'NULL') <0.05]
cyl disp drat
Mazda RX4 6 160.0 3.90
Mazda RX4 Wag 6 160.0 3.90
Datsun 710 4 108.0 3.85
Hornet 4 Drive 6 258.0 3.08
Hornet Sportabout 8 360.0 3.15
Valiant 6 225.0 2.76
Duster 360 8 360.0 3.21
Merc 240D 4 146.7 3.69
Merc 230 4 140.8 3.92
Merc 280 6 167.6 3.92
Merc 280C 6 167.6 3.92
Merc 450SE 8 275.8 3.07
Merc 450SL 8 275.8 3.07
Merc 450SLC 8 275.8 3.07
Cadillac Fleetwood 8 472.0 2.93
CodePudding user response:
With base R, you can just have a condition when you subset. I use colSums
to get the total number of rows in a column that have NULL
, then divide by total number of rows and use the 0.05 threshold.
df[colSums(df == "NULL")/nrow(df) < 0.05]
Output
cyl disp drat
Mazda RX4 6 160.0 3.90
Mazda RX4 Wag 6 160.0 3.90
Datsun 710 4 108.0 3.85
Hornet 4 Drive 6 258.0 3.08
Hornet Sportabout 8 360.0 3.15
Valiant 6 225.0 2.76
Duster 360 8 360.0 3.21
Merc 240D 4 146.7 3.69
Merc 230 4 140.8 3.92
Merc 280 6 167.6 3.92
Merc 280C 6 167.6 3.92
Merc 450SE 8 275.8 3.07
Merc 450SL 8 275.8 3.07
Merc 450SLC 8 275.8 3.07
Cadillac Fleetwood 8 472.0 2.93
Data
df <- structure(list(mpg = c("21", "21", "NULL", "NULL", "18.7", "18.1",
"14.3", "24.4", "22.8", "19.2", "NULL", "16.4", "NULL", "15.2",
"NULL"), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8),
disp = c(160, 160, 108, 258, 360, 225, 360, 146.7, 140.8,
167.6, 167.6, 275.8, 275.8, 275.8, 472), hp = c("NULL", "NULL",
"NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "95", "123",
"123", "180", "180", "180", "205"), drat = c(3.9, 3.9, 3.85,
3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07,
3.07, 2.93)), row.names = c("Mazda RX4", "Mazda RX4 Wag",
"Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", "Valiant",
"Duster 360", "Merc 240D", "Merc 230", "Merc 280", "Merc 280C",
"Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood"
), class = "data.frame")
CodePudding user response:
An alternative in base
:
Filter(\(x) mean(x == "NULL") < 0.05, df)
and its equivalent in dplyr
:
dplyr::select(df, where(~ mean(.x == "NULL") < 0.05))