Home > database >  How to extract values from string lists in R?
How to extract values from string lists in R?

Time:02-14

I want to extract the X-squared value and p-value (number only) from three string vectors.

smr.text1

[1] ""                                                              
[2] "\tPearson's Chi-squared test with Yates' continuity correction"
[3] ""                                                              
[4] "data:  data$parasite and data$T1"                              
[5] "X-squared = 0.017361, df = 1, p-value = 0.8952"                
[6] ""    

smr.txt2    

[1] ""                                                              
[2] "\tPearson's Chi-squared test with Yates' continuity correction"
[3] ""                                                              
[4] "data:  data$parasite and data$T2"                              
[5] "X-squared = 2.5679e-32, df = 1, p-value = 1"                   
[6] ""  

smr.text3

[1] ""                                                              
[2] "\tPearson's Chi-squared test with Yates' continuity correction"
[3] ""                                                              
[4] "data:  data$parasite and data$T3"                              
[5] "X-squared = 0.17857, df = 1, p-value = 0.6726"                
[6] ""  

It was easy for me to extract those values from the first string vector using indexing numbers:

> c1 <- as.numeric(str_sub(smr.txt1[5], 13, 20))

> c1

[1] 0.017361

> p1 <- as.numeric(str_sub(smr.txt1[5], -6))

> p1

[1] 0.8952

But in the second string vector I can't really do the same since it's a scientific number. Also I could do the same with the third string vector, but is there a better way, for example using a loop to extract these values only and put them in the same data frame? Thanks in advance!

CodePudding user response:

Instead of str_sub (which is position based and it wouldn't work when the start/end positions are not constant as in example 2) we may use regex lookaround to extract for p-value substring and the digits with . that follows (str_extract)

library(stringr)
f1 <- function(x, categ ="p-value") {
     as.numeric(str_extract(x, 
        glue::glue("(?<={categ} \\= )[0-9.] (e-[0-9]*)?")))
     }

-testing

> f1("X-squared = 0.017361, df = 1, p-value = 0.8952")
[1] 0.8952
> f1("X-squared = 0.017361, df = 1, p-value = 0.8952", "X-squared")
[1] 0.017361
> f1("X-squared = 2.5679e-32, df = 1, p-value = 1")
[1] 1
> f1("X-squared = 2.5679e-32, df = 1, p-value = 1", "X-squared")
[1] 2.5679e-32
> f1("X-squared = 0.17857, df = 1, p-value = 0.6726")
[1] 0.6726
> f1("X-squared = 0.17857, df = 1, p-value = 0.6726", "X-squared")
[1] 0.17857

Another option would be to convert to data.frame with column names as 'X-squared', 'p-value', 'df' and then extract the column values

f2 <- function(x, categ = "p-value") {

   x1 <-  gsub(",\\s*", "\n", gsub("\\s*=\\s*", ":", x))
   type.convert(as.data.frame(read.dcf(textConnection(x1))),
       as.is = TRUE)[[categ]]


}

-testing

> f2("X-squared = 0.17857, df = 1, p-value = 0.6726", "X-squared")
[1] 0.17857
> f2("X-squared = 0.017361, df = 1, p-value = 0.8952")
[1] 0.8952
> f2("X-squared = 0.017361, df = 1, p-value = 0.8952", "X-squared")
[1] 0.017361
>  f2("X-squared = 2.5679e-32, df = 1, p-value = 1")
[1] 1
> f2("X-squared = 2.5679e-32, df = 1, p-value = 1", "X-squared")
[1] 2.5679e-32
> f2("X-squared = 0.17857, df = 1, p-value = 0.6726")
[1] 0.6726
> f2("X-squared = 0.17857, df = 1, p-value = 0.6726", "X-squared")
[1] 0.17857

It is not clear why we need to convert the output list output from chisq.test to string for extraction i.e. from the output of chisq.test, it is easier to extract with $ or [[

M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
                    party = c("Democrat","Independent", "Republican"))
Xsq <- chisq.test(M)
Xsq$p.value
#[1] 2.953589e-07
Xsq$statistic[["X-squared"]]
[1] 30.07015

CodePudding user response:

While not what you've asked, it looks as though you used capture.output(.) to capture those strings. Instead of trying to extract the strings from the captured output, I suggest you get the real numbers from the objects themselves.

M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
                    party = c("Democrat","Independent", "Republican"))
Xsq <- chisq.test(M)
names(Xsq)
# [1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed"  "expected"  "residuals" "stdres"   
Xsq[c("statistic","p.value")]
# $statistic
# X-squared 
#  30.07015 
# $p.value
# [1] 2.953589e-07

Since you mention having a list of these, it's easy to work with that as well. For instance, if you have a list of test results as in

Xsq2 <- lapply(list(M, M), chisq.test)
Xsq2
# [[1]]
#   Pearson's Chi-squared test
# data:  X[[i]]
# X-squared = 30.07, df = 2, p-value = 2.954e-07
# [[2]]
#   Pearson's Chi-squared test
# data:  X[[i]]
# X-squared = 30.07, df = 2, p-value = 2.954e-07
lapply(Xsq2, `[`, c("statistic", "p.value"))
# [[1]]
# [[1]]$statistic
# X-squared 
#  30.07015 
# [[1]]$p.value
# [1] 2.953589e-07
# [[2]]
# [[2]]$statistic
# X-squared 
#  30.07015 
# [[2]]$p.value
# [1] 2.953589e-07

which can be easily converted into a data.frame with:

do.call(rbind.data.frame, lapply(Xsq2, `[`, c("statistic", "p.value")))
#   statistic      p.value
# 1  30.07015 2.953589e-07
# 2  30.07015 2.953589e-07
  • Related