I have looked at some other posts that seem similar, but none of the solutions have worked for my specific situation.
I have several data frames and need to figure out how many of them have NAs as 50% or more of their contents. I've created this function to determine NA percentage for one dataframe:
(sum(is.na(df)))/(nrow(df)*nrow(df))
This works when I run it on individual dataframes. However, when I try to loop this over the whole list of dataframes, they all return "numeric(0)" or a similar error.
Ideally, I'd be able to store all of these values in a new dataframe as well (na_percents). A sample code that I've generated so far is below. Any help on this would be greatly appreciated.
#Sample Data
counts1<-c(NA, 1,1,1,1,1,5,NA,NA,2, 3, 4, 3,3,2,NA)
counts2<-c(NA,NA,NA,NA,NA,NA,2,4,2,4, NA,5,2,3,NA,NA)
counts3<-c(5,5,1,3,NA,2,NA,NA,NA,NA, 4,3,2,1,1,NA)
head1<-c("Steve", "Charlie", "Kam", "Tom")
head2<-c("Chris", "Ellie", "Ben", "Louis")
head3<-c("Paul", "Tammy", "Sheila", "Sara")
df1<-matrix(counts1, nrow=4, ncol=4, byrow=TRUE)
df2<-matrix(counts2, nrow=4, ncol=4, byrow=TRUE)
df3<-matrix(counts3, nrow=4, ncol=4, byrow=TRUE)
colnames(df1)<-head1
rownames(df1)<-head1
colnames(df2)<-head2
rownames(df2)<-head2
colnames(df3)<-head3
rownames(df3)<-head3
df1<-as.data.frame(df1)
df2<-as.data.frame(df2)
df3<-as.data.frame(df3)
dataframes<-c("df1","df2","df3")
na_percents<-NULL
na_percents$dfs<-dataframes
na_percents<-as.data.frame(na_percents)
# Loop Attempt
for (x in dataframes) {
na_percents$percents<-(sum(is.na(x)))/(nrow(x)*nrow(x))
}
This gives me the error "Error in $<-.data.frame
(*tmp*
, "percents", value = numeric(0)) :
replacement has 0 rows, data has 3"
I've also tried using lapply:
#lapply Attempt
lapply(dataframes, function(x) (sum(is.na(x)))/(nrow(x)*nrow(x)))
Which gives me "numeric(0)" for all dataframes.
Thank you in advance for the help.
CodePudding user response:
Sticking to your original code as much as possible, you can do this in your for
loop with:
na_percents <- data.frame(matrix(NA, ncol = length(dataframes)))
for(i in seq_along(dataframes)){
na_percents[,i] <- (sum(is.na(get(dataframes[i]))))/(nrow(get(dataframes[i]))*nrow(get(dataframes[i])))
}
names(na_percents) <- dataframes
# df1 df2 df3
# 1 0.25 0.5625 0.375
If you wanted the values in rows instead of columns, a slight tweak:
na_percents <- data.frame(perc_na = matrix(NA, nrow = length(dataframes)))
for(i in seq_along(dataframes)){
na_percents[i,] <- (sum(is.na(get(dataframes[i]))))/(nrow(get(dataframes[i]))*nrow(get(dataframes[i])))
}
rownames(na_percents) <- dataframes
# perc_na
# df1 0.2500
# df2 0.5625
# df3 0.3750
You could also use sapply
in the following way, wrapping it in data.frame()
na_percents <- data.frame(perc_NA = sapply(list(df1, df2, df3), function(x)
(sum(is.na(x))) / (nrow(x) * nrow(x))))
# perc_NA
# 1 0.2500
# 2 0.5625
# 3 0.3750
Another method is to use do.call
with lapply
:
na_percents <- do.call(rbind, lapply(list(df1, df2, df3),
function(x) (sum(is.na(x))) / (nrow(x) * nrow(x))))
rownames(na_percents) <- dataframes
# [,1]
# df1 0.2500
# df2 0.5625
# df3 0.3750
As mentioned in the comments, you may be able to optimize your actual calculations with:
sum(is.na(x)) / prod(dim(x))