I need to create a function that receives a set of data and the name or position of one of its factor type variables, in such a way that it calculates the average value of each numerical variable for each of the levels of this factor. I need to use a function, not to do it with the help of packages, because I'm learning to program functions.

I have this function but is not working, the results return missing values

promedioXvariable <- function(df, cat) {
  res <- list()
  for (x in levels(df[[cat]])) {
    aux <- list()
    for (var in colnames(df)) {
      if(class(df[[var]]) == "numeric") {
        aux[[var]] <- with(df, tapply(var, x, mean))
      }
    }
    res[[x]] <- aux
  }
  return(res)
}

The result I want is something like this, but I have with the function NAs:

$setosa $setosa$Sepal.Length setosa NA

CodePudding user response：

Here are solutions using base R: No packages:

aggregate

fun1 <- function(dat, cat){
   aggregate(reformulate(cat, "."), dat, mean)
}

fun1(iris, "Species")
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

by

fun2 <- function(dat, cat){
  by(dat[setdiff(names(dat), cat)], dat[cat], colMeans)
}

fun2(iris, "Species")
Species: setosa
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.006        3.428        1.462        0.246 
------------------------------------------------- 
Species: versicolor
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.936        2.770        4.260        1.326 
------------------------------------------------- 
Species: virginica
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       6.588        2.974        5.552        2.026

split sapply

fun3 <- function(dat, cat){
  sapply(split(dat[setdiff(names(dat), cat)], dat[cat]),colMeans)
}

fun3(iris, "Species")
             setosa versicolor virginica
Sepal.Length  5.006      5.936     6.588
Sepal.Width   3.428      2.770     2.974
Petal.Length  1.462      4.260     5.552
Petal.Width   0.246      1.326     2.026

tapply

fun4 <- function(dat, cat){
  dat1 <- dat[setdiff(names(dat), cat)]
  a <- array(do.call(paste, dat[cat]), dim(dat1))
  b <- array(names(dat1)[col(dat1)], dim(dat1))

  tapply(unlist(dat1), list(a, b), mean)
}

fun4(iris, "Species")
           Petal.Length Petal.Width Sepal.Length Sepal.Width
setosa            1.462       0.246        5.006       3.428
versicolor        4.260       1.326        5.936       2.770
virginica         5.552       2.026        6.588       2.974

sapply tapply

fun5 <- function(dat, cat){
  sapply(dat[setdiff(names(dat), cat)], tapply, dat[cat], mean)
}
          Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

CodePudding user response：

Your main problem is here:

aux[[var]] <- with(df, tapply(var, x, mean))

tapply() expects a factor or list of factors as the INDEX arg, but you’re just passing one factor level as a character (x). Instead, you can subset your data to rows where the cat variable is equal to the factor level x:

promedioXvariable <- function(df, cat) {
  res <- list()
  for (x in levels(df[[cat]])) {
    aux <- list()
    for (var in colnames(df)) {
      if(class(df[[var]]) == "numeric") {
        aux[[var]] <- mean(df[df[[cat]] == x, var])
      }
    }
    res[[x]] <- unlist(aux)
  }
  res
}

promedioXvariable(iris, "Species")

$setosa
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.006        3.428        1.462        0.246 

$versicolor
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.936        2.770        4.260        1.326 

$virginica
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       6.588        2.974        5.552        2.026