Applying a t-test over multiple columns-CodePudding

I would like to apply a t-test to a data frames with differing column lengths.

I have to two data frames:

DF1:

Group	Name	Weight
A	Max	435
B	Jake	657
A	Sarah	99

DF2:

Group	Name	Weight	bmi
A	Mat	435	16
B	Amy	657	35
A	John	99	25

In reality my data frames are much longer and differ by about 100 columns, but have the same organization.

I want to look at the differences in each column by Group. For example, I want to apply a t-test comparing the weights of Group A to the weights of Group B from DF1.

When presented with DF2, I want the t.test to compare the weights between groups A and B AND also compare the bmi between groups A and B.

This is what I have so far:

lapply(DF2[-2], function(x) t.test(x ~ DF2$Group))

I get this error:

Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant") : missing value where TRUE/FALSE needed In addition: Warning messages: 1: In mean.default(x) : argument is not numeric or logical: returning NA 2: In var(x) : NAs introduced by coercion 3: In mean.default(y) : argument is not numeric or logical: returning NA 4: In var(y) : Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant") : missing value where TRUE/FALSE needed

Is there a way to avoid specifying exact column names, but instead telling the t.test function to apply the stats test on every column after the second?

CodePudding user response：

To add to the answer by Onyambu, you could use

df <- data.frame(
  group = c("A", "A", "A", "B", "B", "B"),
  name = paste0("name", 1:6), 
  weight = c(1, 4, 2, 3, 5, 2),
  bmmi = c(20, 14, 32, 55, 42, 12)
)
df
#>   group  name weight bmmi
#> 1     A name1      1   20
#> 2     A name2      4   14
#> 3     A name3      2   32
#> 4     B name4      3   55
#> 5     B name5      5   42
#> 6     B name6      2   12
variables <- names(df)[-c(1, 2)] # don't use first the two names
variables
#> [1] "weight" "bmmi"
ttests <- lapply(variables, function(x) t.test(reformulate("group", x), data = df))
ttests[[1]] # view the first
#> 
#>  Welch Two Sample t-test
#> 
#> data:  weight by group
#> t = -0.80178, df = 4, p-value = 0.4676
#> alternative hypothesis: true difference in means between group A and group B is not equal to 0
#> 95 percent confidence interval:
#>  -4.462835  2.462835
#> sample estimates:
#> mean in group A mean in group B 
#>        2.333333        3.333333

The variables vector will contain all variable names in the data frame except the first two, and lapply() will iterate through each element in variables.

CodePudding user response：

You need to drop both the first two columns. Also the t.test requires at least two values for x and y. I cloned your DF2 dataframe in the following:

df2 = structure(list(Group = c("A", "B", "A"), Name = c("Mat", "Amy", 
"John"), Weight = c(435L, 657L, 99L), bmi = c(16L, 35L, 25L)), class =
"data.frame", row.names = c(NA, -3L)) 
df3 = rbind(df2, df2)
lapply(df3[3:4], function(x) t.test(x ~ Group, data=df3))

#         Welch Two Sample t-test
# 
# data:  x by Group
# t = -4, df = 3, p-value = 0.03
# alternative hypothesis: true difference in means between group A and group B is 
# not equal to 0
# 95 percent confidence interval:
#  -699  -81
# sample estimates:
# mean in group A mean in group B 
#             267             657 
# 
# 
# $bmi
# 
#         Welch Two Sample t-test
# 
# data:  x by Group
# t = -6, df = 3, p-value = 0.01
# alternative hypothesis: true difference in means between group A and group B is 
# not equal to 0
# 95 percent confidence interval:
#  -22.8  -6.2
# sample estimates:
# mean in group A mean in group B 
#              20              35