Loop function over data frames and put data in new column-CodePudding

I have looked at some other posts that seem similar, but none of the solutions have worked for my specific situation.

I have several data frames and need to figure out how many of them have NAs as 50% or more of their contents. I've created this function to determine NA percentage for one dataframe:

(sum(is.na(df)))/(nrow(df)*nrow(df))

This works when I run it on individual dataframes. However, when I try to loop this over the whole list of dataframes, they all return "numeric(0)" or a similar error.

Ideally, I'd be able to store all of these values in a new dataframe as well (na_percents). A sample code that I've generated so far is below. Any help on this would be greatly appreciated.

#Sample Data
counts1<-c(NA, 1,1,1,1,1,5,NA,NA,2, 3, 4, 3,3,2,NA)
counts2<-c(NA,NA,NA,NA,NA,NA,2,4,2,4, NA,5,2,3,NA,NA)
counts3<-c(5,5,1,3,NA,2,NA,NA,NA,NA, 4,3,2,1,1,NA)
head1<-c("Steve", "Charlie", "Kam", "Tom")
head2<-c("Chris", "Ellie", "Ben", "Louis")
head3<-c("Paul", "Tammy", "Sheila", "Sara")

df1<-matrix(counts1, nrow=4, ncol=4, byrow=TRUE)
df2<-matrix(counts2, nrow=4, ncol=4, byrow=TRUE)
df3<-matrix(counts3, nrow=4, ncol=4, byrow=TRUE)

colnames(df1)<-head1
rownames(df1)<-head1

colnames(df2)<-head2
rownames(df2)<-head2

colnames(df3)<-head3
rownames(df3)<-head3

df1<-as.data.frame(df1)
df2<-as.data.frame(df2)
df3<-as.data.frame(df3)

dataframes<-c("df1","df2","df3")

na_percents<-NULL
na_percents$dfs<-dataframes
na_percents<-as.data.frame(na_percents)

# Loop Attempt 
for (x in dataframes) {
  na_percents$percents<-(sum(is.na(x)))/(nrow(x)*nrow(x))
}

This gives me the error "Error in $<-.data.frame(*tmp*, "percents", value = numeric(0)) : replacement has 0 rows, data has 3"

I've also tried using lapply:

#lapply Attempt
lapply(dataframes, function(x) (sum(is.na(x)))/(nrow(x)*nrow(x)))

Which gives me "numeric(0)" for all dataframes.

Thank you in advance for the help.

CodePudding user response：

Sticking to your original code as much as possible, you can do this in your for loop with:

na_percents <- data.frame(matrix(NA, ncol = length(dataframes)))

for(i in seq_along(dataframes)){
  na_percents[,i] <- (sum(is.na(get(dataframes[i]))))/(nrow(get(dataframes[i]))*nrow(get(dataframes[i])))  
}

names(na_percents) <- dataframes

#    df1    df2   df3
# 1 0.25 0.5625 0.375

If you wanted the values in rows instead of columns, a slight tweak:

na_percents <- data.frame(perc_na = matrix(NA, nrow = length(dataframes)))

for(i in seq_along(dataframes)){
  na_percents[i,] <- (sum(is.na(get(dataframes[i]))))/(nrow(get(dataframes[i]))*nrow(get(dataframes[i])))  
}

rownames(na_percents) <- dataframes

#     perc_na
# df1  0.2500
# df2  0.5625
# df3  0.3750

You could also use sapply in the following way, wrapping it in data.frame()

na_percents <- data.frame(perc_NA = sapply(list(df1, df2, df3), function(x)
  (sum(is.na(x))) / (nrow(x) * nrow(x))))

#   perc_NA
# 1  0.2500
# 2  0.5625
# 3  0.3750

Another method is to use do.call with lapply:

na_percents <- do.call(rbind, lapply(list(df1, df2, df3), 
                      function(x) (sum(is.na(x))) / (nrow(x) * nrow(x))))
rownames(na_percents) <- dataframes

#       [,1]
# df1 0.2500
# df2 0.5625
# df3 0.3750

As mentioned in the comments, you may be able to optimize your actual calculations with:

sum(is.na(x)) / prod(dim(x))