I need to create a function that receives a set of data and the name or position of one of its factor type variables, in such a way that it calculates the average value of each numerical variable for each of the levels of this factor. I need to use a function, not to do it with the help of packages, because I'm learning to program functions.
I have this function but is not working, the results return missing values
promedioXvariable <- function(df, cat) {
res <- list()
for (x in levels(df[[cat]])) {
aux <- list()
for (var in colnames(df)) {
if(class(df[[var]]) == "numeric") {
aux[[var]] <- with(df, tapply(var, x, mean))
}
}
res[[x]] <- aux
}
return(res)
}
The result I want is something like this, but I have with the function NAs:
$setosa $setosa$Sepal.Length setosa NA
CodePudding user response:
Here are solutions using base R: No packages:
aggregate
fun1 <- function(dat, cat){
aggregate(reformulate(cat, "."), dat, mean)
}
fun1(iris, "Species")
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
by
fun2 <- function(dat, cat){
by(dat[setdiff(names(dat), cat)], dat[cat], colMeans)
}
fun2(iris, "Species")
Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
-------------------------------------------------
Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.936 2.770 4.260 1.326
-------------------------------------------------
Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.588 2.974 5.552 2.026
split sapply
fun3 <- function(dat, cat){
sapply(split(dat[setdiff(names(dat), cat)], dat[cat]),colMeans)
}
fun3(iris, "Species")
setosa versicolor virginica
Sepal.Length 5.006 5.936 6.588
Sepal.Width 3.428 2.770 2.974
Petal.Length 1.462 4.260 5.552
Petal.Width 0.246 1.326 2.026
tapply
fun4 <- function(dat, cat){
dat1 <- dat[setdiff(names(dat), cat)]
a <- array(do.call(paste, dat[cat]), dim(dat1))
b <- array(names(dat1)[col(dat1)], dim(dat1))
tapply(unlist(dat1), list(a, b), mean)
}
fun4(iris, "Species")
Petal.Length Petal.Width Sepal.Length Sepal.Width
setosa 1.462 0.246 5.006 3.428
versicolor 4.260 1.326 5.936 2.770
virginica 5.552 2.026 6.588 2.974
sapply tapply
fun5 <- function(dat, cat){
sapply(dat[setdiff(names(dat), cat)], tapply, dat[cat], mean)
}
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
CodePudding user response:
Your main problem is here:
aux[[var]] <- with(df, tapply(var, x, mean))
tapply()
expects a factor or list of factors as the INDEX
arg, but you’re just passing one factor level as a character (x
). Instead, you can subset your data to rows where the cat
variable is equal to the factor level x
:
promedioXvariable <- function(df, cat) {
res <- list()
for (x in levels(df[[cat]])) {
aux <- list()
for (var in colnames(df)) {
if(class(df[[var]]) == "numeric") {
aux[[var]] <- mean(df[df[[cat]] == x, var])
}
}
res[[x]] <- unlist(aux)
}
res
}
promedioXvariable(iris, "Species")
$setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
$versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.936 2.770 4.260 1.326
$virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.588 2.974 5.552 2.026