Home > Blockchain >  loop over factors and numerics to calculate their means
loop over factors and numerics to calculate their means

Time:11-02

I am trying to create a function that loops over my entire data frame. If the column is a numeric it will return the mean, but if the column is a factor it will have to do a little more to get the overall mean. At the moment, I am less concerned about the frequencies for the categories in the factor--I have research reasons for this. So far, I have been able to cobble some of this together, but I know I am nowhere it needs to be to accomplish this. Here is my code so far:

#basic data frame 3 variables
dat = data.frame("index" = c(1, 2, 3, 4, 5),
                     "age" = c(24, 25, 42, 56, 22), 
                     "sex" = c(0,1,1,0,0))

mean(dat$sex)
mean(dat$age)

#converting sex into a factor
dat[,3] = as.factor(dat[,3]) 

#working on the if structure to calculate the mean for all of the variables

me_func = function(x){
for (i in seq_along(x)){
if (is.factor(x)==TRUE){
  return(mean(as.numeric(as.character(x), na.rm=TRUE)))
} else {
  return(mean(x), na.rm=TRUE)
}
}
}
me_func(dat)

Because I am trying to learn coding with R, I know I am missing a lot. My intent in the function call is to use the data frame name as the input. Given when I use this for my research, will have much larger data frames, so listing out the names themselves will be rather cumbersome. This, also, complicates things because the id variable will have to be ignored to get this correct.

Ultimately, I need the function to return the proper means of 0.40 for the factor variable and 33.8 for the numerical variable. I need to be able to learn this process as it appears to be important for the data analyses I will be doing in the foreseeable future. I thought about ColMeans, but this does not get me out of a loop or some type of apply. The factors would have to be coerced to numerics to do this, and the coercion may provide non-sensical means as R has a tendency to change a 0 to a 2 when it is coerced, or at least, in my extremely limited experience it seems to do this. I, legitimately, only want the mean for all of the non-id variables/columns for the entire data frame. Does anyone have any ideas on how this will work? If I have missed a post that does this already, please, feel free to point me in that direction. Thank you

CodePudding user response:

You can create my_func as a function that gets the mean of a vector (remove the for loop), and then apply it to every column using sapply.

me_func = function(x){
  if (is.factor(x)==TRUE){
    return(mean(as.numeric(as.character(x)), na.rm=TRUE))
  } else {
    return(mean(x, na.rm=TRUE))
  }
}

> sapply(dat[,-1], me_func)
 age  sex 
33.8  0.4 
  • Related