What kind of formula is used to calculate the p-value in `t.test`?-CodePudding

So, just a touch of backstory. I've been learning biostatistics in the past 4-5 months in university, 6 months of biomathematics before that. I only started deep diving into programming around 5 days ago. I've been trying to redo t.test() with my own function.

test2 = function(t,u){
  T = (mean(t) - u) / ( sd(t) / sqrt(length(t)))
  t1=round(T, digits=5)
  df=length(t)
  cat(paste('t - value =', t1,
  '\n','df =', df-1, 
      '\n','Alternative hipotézis: a minta átlag nem egyenlő a hipotetikus átlaggal'))
}

I tried searching the formula for the p-value, I found one, but when I used it, my value was different from the one within the t.test. The t-value and the df do match t.test(). I highly appreciate any help, thank you. P.s: Don't worry about the last line, it's in Hungarian.

CodePudding user response：

The p-value can be derived from the probability function of the t distribution pt. Using this and making the notation more common with sample x and population mean mu we can use something like:

test2 <- function(x, u){
  t   <- (mean(x) - u) / (sd(x) / sqrt(length(x)))
  df  <- length(x) - 1
  cat('t-value =', t, ', df =', df, ', p =', 2 * (1 - pt(q=t, df=df)), '\n')
}

set.seed(123) # remove this for other random values

## random sample
x <- rnorm(10, mean=5.5)

## population mean
mu <- 5

## own function
test2(x, mu)

## one sample t-test from R
t.test(x, mu=mu)

We get for the own test2:

t-value = 1.905175 , df = 9, p = 0.08914715

and for R's t.test

    One Sample t-test

data:  x
t = 1.9052, df = 9, p-value = 0.08915
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 4.892330 6.256922
sample estimates:
mean of x 
 5.574626

CodePudding user response：

The definitive source of what R is doing is the source code. If you look at the source code for stats:::t.test.default (which you can get by typing stats:::t.test.default into the console, without parentheses at the end and hitting enter), you'll see that for a single-sample test like the one you're trying to do above, you would get the following:

  nx <- length(x)
  mx <- mean(x)
  vx <- var(x)
  df <- nx - 1
  stderr <- sqrt(vx/nx)
  tstat <- (mx - mu)/stderr
  if (alternative == "less") {
    pval <- pt(tstat, df)
  }
  else if (alternative == "greater") {
    pval <- pt(tstat, df, lower.tail = FALSE)
  }
  else {
    pval <- 2 * pt(-abs(tstat), df)
  }

These are the relevant pieces (there's a lot more code in there, too).