Home > Enterprise >  Loop function over data frames and put data in new column
Loop function over data frames and put data in new column

Time:01-25

I have looked at some other posts that seem similar, but none of the solutions have worked for my specific situation.

I have several data frames and need to figure out how many of them have NAs as 50% or more of their contents. I've created this function to determine NA percentage for one dataframe:

(sum(is.na(df)))/(nrow(df)*nrow(df))

This works when I run it on individual dataframes. However, when I try to loop this over the whole list of dataframes, they all return "numeric(0)" or a similar error.

Ideally, I'd be able to store all of these values in a new dataframe as well (na_percents). A sample code that I've generated so far is below. Any help on this would be greatly appreciated.

#Sample Data
counts1<-c(NA, 1,1,1,1,1,5,NA,NA,2, 3, 4, 3,3,2,NA)
counts2<-c(NA,NA,NA,NA,NA,NA,2,4,2,4, NA,5,2,3,NA,NA)
counts3<-c(5,5,1,3,NA,2,NA,NA,NA,NA, 4,3,2,1,1,NA)
head1<-c("Steve", "Charlie", "Kam", "Tom")
head2<-c("Chris", "Ellie", "Ben", "Louis")
head3<-c("Paul", "Tammy", "Sheila", "Sara")

df1<-matrix(counts1, nrow=4, ncol=4, byrow=TRUE)
df2<-matrix(counts2, nrow=4, ncol=4, byrow=TRUE)
df3<-matrix(counts3, nrow=4, ncol=4, byrow=TRUE)

colnames(df1)<-head1
rownames(df1)<-head1

colnames(df2)<-head2
rownames(df2)<-head2

colnames(df3)<-head3
rownames(df3)<-head3

df1<-as.data.frame(df1)
df2<-as.data.frame(df2)
df3<-as.data.frame(df3)

dataframes<-c("df1","df2","df3")

na_percents<-NULL
na_percents$dfs<-dataframes
na_percents<-as.data.frame(na_percents)
# Loop Attempt 
for (x in dataframes) {
  na_percents$percents<-(sum(is.na(x)))/(nrow(x)*nrow(x))
}

This gives me the error "Error in $<-.data.frame(*tmp*, "percents", value = numeric(0)) : replacement has 0 rows, data has 3"

I've also tried using lapply:

#lapply Attempt
lapply(dataframes, function(x) (sum(is.na(x)))/(nrow(x)*nrow(x)))

Which gives me "numeric(0)" for all dataframes.

Thank you in advance for the help.

CodePudding user response:

Sticking to your original code as much as possible, you can do this in your for loop with:

na_percents <- data.frame(matrix(NA, ncol = length(dataframes)))

for(i in seq_along(dataframes)){
  na_percents[,i] <- (sum(is.na(get(dataframes[i]))))/(nrow(get(dataframes[i]))*nrow(get(dataframes[i])))  
}

names(na_percents) <- dataframes

#    df1    df2   df3
# 1 0.25 0.5625 0.375

If you wanted the values in rows instead of columns, a slight tweak:

na_percents <- data.frame(perc_na = matrix(NA, nrow = length(dataframes)))

for(i in seq_along(dataframes)){
  na_percents[i,] <- (sum(is.na(get(dataframes[i]))))/(nrow(get(dataframes[i]))*nrow(get(dataframes[i])))  
}

rownames(na_percents) <- dataframes

#     perc_na
# df1  0.2500
# df2  0.5625
# df3  0.3750

You could also use sapply in the following way, wrapping it in data.frame()

na_percents <- data.frame(perc_NA = sapply(list(df1, df2, df3), function(x)
  (sum(is.na(x))) / (nrow(x) * nrow(x))))

#   perc_NA
# 1  0.2500
# 2  0.5625
# 3  0.3750

Another method is to use do.call with lapply:

na_percents <- do.call(rbind, lapply(list(df1, df2, df3), 
                      function(x) (sum(is.na(x))) / (nrow(x) * nrow(x))))
rownames(na_percents) <- dataframes

#       [,1]
# df1 0.2500
# df2 0.5625
# df3 0.3750

As mentioned in the comments, you may be able to optimize your actual calculations with:

sum(is.na(x)) / prod(dim(x))
  • Related