I want to extract the X-squared value and p-value (number only) from three string vectors.
smr.text1
[1] ""
[2] "\tPearson's Chi-squared test with Yates' continuity correction"
[3] ""
[4] "data: data$parasite and data$T1"
[5] "X-squared = 0.017361, df = 1, p-value = 0.8952"
[6] ""
smr.txt2
[1] ""
[2] "\tPearson's Chi-squared test with Yates' continuity correction"
[3] ""
[4] "data: data$parasite and data$T2"
[5] "X-squared = 2.5679e-32, df = 1, p-value = 1"
[6] ""
smr.text3
[1] ""
[2] "\tPearson's Chi-squared test with Yates' continuity correction"
[3] ""
[4] "data: data$parasite and data$T3"
[5] "X-squared = 0.17857, df = 1, p-value = 0.6726"
[6] ""
It was easy for me to extract those values from the first string vector using indexing numbers:
> c1 <- as.numeric(str_sub(smr.txt1[5], 13, 20))
> c1
[1] 0.017361
> p1 <- as.numeric(str_sub(smr.txt1[5], -6))
> p1
[1] 0.8952
But in the second string vector I can't really do the same since it's a scientific number. Also I could do the same with the third string vector, but is there a better way, for example using a loop to extract these values only and put them in the same data frame? Thanks in advance!
CodePudding user response:
Instead of str_sub
(which is position based and it wouldn't work when the start/end positions are not constant as in example 2) we may use regex lookaround to extract for p-value substring and the digits with .
that follows (str_extract
)
library(stringr)
f1 <- function(x, categ ="p-value") {
as.numeric(str_extract(x,
glue::glue("(?<={categ} \\= )[0-9.] (e-[0-9]*)?")))
}
-testing
> f1("X-squared = 0.017361, df = 1, p-value = 0.8952")
[1] 0.8952
> f1("X-squared = 0.017361, df = 1, p-value = 0.8952", "X-squared")
[1] 0.017361
> f1("X-squared = 2.5679e-32, df = 1, p-value = 1")
[1] 1
> f1("X-squared = 2.5679e-32, df = 1, p-value = 1", "X-squared")
[1] 2.5679e-32
> f1("X-squared = 0.17857, df = 1, p-value = 0.6726")
[1] 0.6726
> f1("X-squared = 0.17857, df = 1, p-value = 0.6726", "X-squared")
[1] 0.17857
Another option would be to convert to data.frame
with column names as 'X-squared', 'p-value', 'df' and then extract the column values
f2 <- function(x, categ = "p-value") {
x1 <- gsub(",\\s*", "\n", gsub("\\s*=\\s*", ":", x))
type.convert(as.data.frame(read.dcf(textConnection(x1))),
as.is = TRUE)[[categ]]
}
-testing
> f2("X-squared = 0.17857, df = 1, p-value = 0.6726", "X-squared")
[1] 0.17857
> f2("X-squared = 0.017361, df = 1, p-value = 0.8952")
[1] 0.8952
> f2("X-squared = 0.017361, df = 1, p-value = 0.8952", "X-squared")
[1] 0.017361
> f2("X-squared = 2.5679e-32, df = 1, p-value = 1")
[1] 1
> f2("X-squared = 2.5679e-32, df = 1, p-value = 1", "X-squared")
[1] 2.5679e-32
> f2("X-squared = 0.17857, df = 1, p-value = 0.6726")
[1] 0.6726
> f2("X-squared = 0.17857, df = 1, p-value = 0.6726", "X-squared")
[1] 0.17857
It is not clear why we need to convert the output list
output from chisq.test
to string for extraction i.e. from the output of chisq.test
, it is easier to extract with $
or [[
M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
party = c("Democrat","Independent", "Republican"))
Xsq <- chisq.test(M)
Xsq$p.value
#[1] 2.953589e-07
Xsq$statistic[["X-squared"]]
[1] 30.07015
CodePudding user response:
While not what you've asked, it looks as though you used capture.output(.)
to capture those strings. Instead of trying to extract the strings from the captured output, I suggest you get the real numbers from the objects themselves.
M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
party = c("Democrat","Independent", "Republican"))
Xsq <- chisq.test(M)
names(Xsq)
# [1] "statistic" "parameter" "p.value" "method" "data.name" "observed" "expected" "residuals" "stdres"
Xsq[c("statistic","p.value")]
# $statistic
# X-squared
# 30.07015
# $p.value
# [1] 2.953589e-07
Since you mention having a list of these, it's easy to work with that as well. For instance, if you have a list of test results as in
Xsq2 <- lapply(list(M, M), chisq.test)
Xsq2
# [[1]]
# Pearson's Chi-squared test
# data: X[[i]]
# X-squared = 30.07, df = 2, p-value = 2.954e-07
# [[2]]
# Pearson's Chi-squared test
# data: X[[i]]
# X-squared = 30.07, df = 2, p-value = 2.954e-07
lapply(Xsq2, `[`, c("statistic", "p.value"))
# [[1]]
# [[1]]$statistic
# X-squared
# 30.07015
# [[1]]$p.value
# [1] 2.953589e-07
# [[2]]
# [[2]]$statistic
# X-squared
# 30.07015
# [[2]]$p.value
# [1] 2.953589e-07
which can be easily converted into a data.frame
with:
do.call(rbind.data.frame, lapply(Xsq2, `[`, c("statistic", "p.value")))
# statistic p.value
# 1 30.07015 2.953589e-07
# 2 30.07015 2.953589e-07