Home > Software engineering >  Applying a t-test over multiple columns
Applying a t-test over multiple columns

Time:11-18

I would like to apply a t-test to a data frames with differing column lengths.

I have to two data frames:

DF1:

Group Name Weight
A Max 435
B Jake 657
A Sarah 99

DF2:

Group Name Weight bmi
A Mat 435 16
B Amy 657 35
A John 99 25

In reality my data frames are much longer and differ by about 100 columns, but have the same organization.

I want to look at the differences in each column by Group. For example, I want to apply a t-test comparing the weights of Group A to the weights of Group B from DF1.

When presented with DF2, I want the t.test to compare the weights between groups A and B AND also compare the bmi between groups A and B.

This is what I have so far:

lapply(DF2[-2], function(x) t.test(x ~ DF2$Group))

I get this error:

Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant") : missing value where TRUE/FALSE needed In addition: Warning messages: 1: In mean.default(x) : argument is not numeric or logical: returning NA 2: In var(x) : NAs introduced by coercion 3: In mean.default(y) : argument is not numeric or logical: returning NA 4: In var(y) : Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant") : missing value where TRUE/FALSE needed

Is there a way to avoid specifying exact column names, but instead telling the t.test function to apply the stats test on every column after the second?

CodePudding user response:

To add to the answer by Onyambu, you could use

df <- data.frame(
  group = c("A", "A", "A", "B", "B", "B"),
  name = paste0("name", 1:6), 
  weight = c(1, 4, 2, 3, 5, 2),
  bmmi = c(20, 14, 32, 55, 42, 12)
)
df
#>   group  name weight bmmi
#> 1     A name1      1   20
#> 2     A name2      4   14
#> 3     A name3      2   32
#> 4     B name4      3   55
#> 5     B name5      5   42
#> 6     B name6      2   12
variables <- names(df)[-c(1, 2)] # don't use first the two names
variables
#> [1] "weight" "bmmi"
ttests <- lapply(variables, function(x) t.test(reformulate("group", x), data = df))
ttests[[1]] # view the first
#> 
#>  Welch Two Sample t-test
#> 
#> data:  weight by group
#> t = -0.80178, df = 4, p-value = 0.4676
#> alternative hypothesis: true difference in means between group A and group B is not equal to 0
#> 95 percent confidence interval:
#>  -4.462835  2.462835
#> sample estimates:
#> mean in group A mean in group B 
#>        2.333333        3.333333

The variables vector will contain all variable names in the data frame except the first two, and lapply() will iterate through each element in variables.

CodePudding user response:

You need to drop both the first two columns. Also the t.test requires at least two values for x and y. I cloned your DF2 dataframe in the following:

df2 = structure(list(Group = c("A", "B", "A"), Name = c("Mat", "Amy", 
"John"), Weight = c(435L, 657L, 99L), bmi = c(16L, 35L, 25L)), class =
"data.frame", row.names = c(NA, -3L)) 
df3 = rbind(df2, df2)
lapply(df3[3:4], function(x) t.test(x ~ Group, data=df3))

#         Welch Two Sample t-test
# 
# data:  x by Group
# t = -4, df = 3, p-value = 0.03
# alternative hypothesis: true difference in means between group A and group B is 
# not equal to 0
# 95 percent confidence interval:
#  -699  -81
# sample estimates:
# mean in group A mean in group B 
#             267             657 
# 
# 
# $bmi
# 
#         Welch Two Sample t-test
# 
# data:  x by Group
# t = -6, df = 3, p-value = 0.01
# alternative hypothesis: true difference in means between group A and group B is 
# not equal to 0
# 95 percent confidence interval:
#  -22.8  -6.2
# sample estimates:
# mean in group A mean in group B 
#              20              35 
  • Related