I would like to apply a t-test to a data frames with differing column lengths.
I have to two data frames:
DF1:
Group | Name | Weight |
---|---|---|
A | Max | 435 |
B | Jake | 657 |
A | Sarah | 99 |
DF2:
Group | Name | Weight | bmi |
---|---|---|---|
A | Mat | 435 | 16 |
B | Amy | 657 | 35 |
A | John | 99 | 25 |
In reality my data frames are much longer and differ by about 100 columns, but have the same organization.
I want to look at the differences in each column by Group. For example, I want to apply a t-test comparing the weights of Group A to the weights of Group B from DF1.
When presented with DF2, I want the t.test to compare the weights between groups A and B AND also compare the bmi between groups A and B.
This is what I have so far:
lapply(DF2[-2], function(x) t.test(x ~ DF2$Group))
I get this error:
Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant") : missing value where TRUE/FALSE needed In addition: Warning messages: 1: In mean.default(x) : argument is not numeric or logical: returning NA 2: In var(x) : NAs introduced by coercion 3: In mean.default(y) : argument is not numeric or logical: returning NA 4: In var(y) : Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant") : missing value where TRUE/FALSE needed
Is there a way to avoid specifying exact column names, but instead telling the t.test function to apply the stats test on every column after the second?
CodePudding user response:
To add to the answer by Onyambu, you could use
df <- data.frame(
group = c("A", "A", "A", "B", "B", "B"),
name = paste0("name", 1:6),
weight = c(1, 4, 2, 3, 5, 2),
bmmi = c(20, 14, 32, 55, 42, 12)
)
df
#> group name weight bmmi
#> 1 A name1 1 20
#> 2 A name2 4 14
#> 3 A name3 2 32
#> 4 B name4 3 55
#> 5 B name5 5 42
#> 6 B name6 2 12
variables <- names(df)[-c(1, 2)] # don't use first the two names
variables
#> [1] "weight" "bmmi"
ttests <- lapply(variables, function(x) t.test(reformulate("group", x), data = df))
ttests[[1]] # view the first
#>
#> Welch Two Sample t-test
#>
#> data: weight by group
#> t = -0.80178, df = 4, p-value = 0.4676
#> alternative hypothesis: true difference in means between group A and group B is not equal to 0
#> 95 percent confidence interval:
#> -4.462835 2.462835
#> sample estimates:
#> mean in group A mean in group B
#> 2.333333 3.333333
The variables
vector will contain all variable names in the data frame except the first two, and lapply()
will iterate through each element in variables
.
CodePudding user response:
You need to drop both the first two columns. Also the t.test requires at least two values for x and y. I cloned your DF2 dataframe in the following:
df2 = structure(list(Group = c("A", "B", "A"), Name = c("Mat", "Amy",
"John"), Weight = c(435L, 657L, 99L), bmi = c(16L, 35L, 25L)), class =
"data.frame", row.names = c(NA, -3L))
df3 = rbind(df2, df2)
lapply(df3[3:4], function(x) t.test(x ~ Group, data=df3))
# Welch Two Sample t-test
#
# data: x by Group
# t = -4, df = 3, p-value = 0.03
# alternative hypothesis: true difference in means between group A and group B is
# not equal to 0
# 95 percent confidence interval:
# -699 -81
# sample estimates:
# mean in group A mean in group B
# 267 657
#
#
# $bmi
#
# Welch Two Sample t-test
#
# data: x by Group
# t = -6, df = 3, p-value = 0.01
# alternative hypothesis: true difference in means between group A and group B is
# not equal to 0
# 95 percent confidence interval:
# -22.8 -6.2
# sample estimates:
# mean in group A mean in group B
# 20 35