Home > Software engineering >  How do I extract estimates and p-value from cor.test results across multiple groups?
How do I extract estimates and p-value from cor.test results across multiple groups?

Time:11-25

I have a set of data that has multiple groups across multiple time points (sample cut below):

enter image description here

I've been trying to conduct multiple cor.test between X and Y between each status, and across each gender.

I was having a hard time figuring out the group comparison so I filtered by status and split my gender cor.tests separately for Status = Red and Status = Blue (using filter).

This is my current code which runs cor.test across each gender:

    red_status <- all %>% filter(status == "Red")
    cor_red <- by(red_status, red_status$gender, 
                  FUN = function(df) cor.test(df$X, df$Y, method = "spearman"))

The output result shows the 3 different cor.test across each gender:


red_status$gengrp: M
    Spearman's rank correlation rho

data:  df$X and df$Y
S = 123.45, p-value = 0.123
alternative hypothesis: true rho is not equal to 0
sample estimates:
     rho 
0.123456 

----------------------------------

red_status$gengrp: F
... (same output style as gengrp: M ^)

----------------------------------
red_status$gengrp: O
... (same output style as gengrp: M ^)

What I'm trying to do is extract the estimate and p-values across all gender cor.test and place them in a dataframe.

I thought I can use the data.frame() function to extract the gender name and correlation elements, then just add another column for p-values, however doing this gave me an error:

# Where red_status[1] is gender names (M,F,O) and red_status[[3:4]] are the Spearman p-value and rho estimate *within each gender category*
data.frame(group = dimnames(red_status)[1], est = as.vector(red_status)[[3]], 
pval = as.vector(red_status[[4]])
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  cannot coerce class ‘"htest"’ to a data.frame

Since I filtered by Status == Red, I'd have to rerun the codes again to get the result for the gender cor.test results of Status == Blue then at the end bind the estimates and p-values all to 1 df.

My goal is to be able to create a data frame that shows the correlation estimate and p-value across each status and gender:

Status   Gender   Estimate(rho)    P-value
   Red        M            1.23      0.123
   Red        F            0.45      0.054
   Red        O             ...        ...
  Blue        M           0.004      0.123
  Blue        F             ...        ...
  Blue        O             ...        ...

Any help/tips would be appreciated.

CodePudding user response:

Answer based off of @Ruam Pimentel's comment regarding the rstatix package.

Using piping the cor_test() function, I can group by both status and gender to run the correlation test. The cor_test() function is designed to output a dataframe containing all elements of the correlation (i.e. statistic, p-value, estimate) all depending on the correlation method of choice.

This is the code that worked:

 r <- dfall %>%
                group_by(status, gender) %>%
                cor_test(X, Y, method = "spearman")

Result (edited the numbers):

  status  gengrp var1   var2   cor  statistic    p   method  
  <chr>   <fct>  <chr>  <chr> <dbl>     <dbl>  <dbl> <chr>   
1 Red       M     X       Y    0.98    -0.123  0.123  Spearman
2 Red       F     X       Y    0.12     0.456  0.456  Spearman
3 Red       O     X       Y    0.34     0.944  0.789  Spearman
4 Blue      M     X       Y    0.56     0.246  0.101  Spearman
5 Blue      F     X       Y    0.78    0.4107  0.111  Spearman
6 Blue      O     X       Y     0.9     0.123  0.122  Spearman

CodePudding user response:

While the rstatix solution will require considerably less code than a Base R solution, I'm posting one to demonstrate that it is possible to create the requested output in Base R.

The key to a Base R solution is understanding how to navigate the output object from cor.test() to extract the requested content, convert it into single row data frames, and rbind() the list of objects into a single data frame.

First, we generate some data.

set.seed(9108171) # for reproducibility
gender <- rep(c("F","M", "O"),40)
status <- c(rep("Red",60),rep("Blue",60))
x <- round(rnorm(120,mean = 6,sd = 2),1)
y <- round(rnorm(120,mean = 8,sd = 1),1)

df <- data.frame(gender,status, x, y)

Next, we filter down to the red items per the code in the original post, and run the cor.test() function.

# filter red
library(dplyr)
red_status <- filter(df,status == "Red")
cor_red <- by(red_status, red_status$gender,
              FUN = function(df) cor.test(df$x, df$y, method = "spearman"))

Finally we use lapply() to extract the data and combine it into a data frame. Note the use of the [[ form of the extract operator, which removes one layer of nesting from the list of lists that was generated by cor.test(). Also note how we use a vector of names, the categories of gender, to drive lapply(). These values were used to distinguish the by groups in the previous call to cor.test().

# extract the required data
theResults <- lapply(c("F","M","O"),function(x){
  aTest <- cor_red[[x]]
  data.frame(status = "Red",
             test = names(aTest$statistic),
             value = aTest$statistic,
             p_value = aTest$p.value,
             rho = aTest$estimate)
})

# rbind the results into a single data frame
do.call(rbind,theResults)

...and the results for the Red status:

> do.call(rbind,theResults)
   status test    value   p_value         rho
S     Red    S 1072.672 0.4137465  0.19347947
S1    Red    S 1396.400 0.8344303 -0.04992459
S2    Red    S 1281.132 0.8777763  0.03674259

We can repeat the procedure for the Blue status and combine the results to obtain the following:

> rbind(blueResults,redResults)
    status test    value    p_value         rho
S     Blue    S 1541.034 0.50402812 -0.15867211
S1    Blue    S 1087.954 0.44253280  0.18198950
S2    Blue    S 1880.742 0.06950608 -0.41409194
S3     Red    S 1072.672 0.41374648  0.19347947
S11    Red    S 1396.400 0.83443026 -0.04992459
S21    Red    S 1281.132 0.87777629  0.03674259
>

The complete script to produce the final table is:

set.seed(9108171) # for reproducibility
gender <- rep(c("F","M", "O"),40)
status <- c(rep("Red",60),rep("Blue",60))
x <- round(rnorm(120,mean = 6,sd = 2),1)
y <- round(rnorm(120,mean = 8,sd = 1),1)

df <- data.frame(gender,status, x, y)

# filter red
library(dplyr)
red_status <- filter(df,status == "Red")
cor_red <- by(red_status, red_status$gender,
              FUN = function(df) cor.test(df$x, df$y, method = "spearman"))

# extract the required data
theResults <- lapply(c("F","M","O"),function(x){
  aTest <- cor_red[[x]]
  data.frame(status = "Red",
             test = names(aTest$statistic),
             value = aTest$statistic,
             p_value = aTest$p.value,
             rho = aTest$estimate)
})

# rbind the results into a single data frame
redResults <- do.call(rbind,theResults)
redResults

blue_status <- filter(df,status == "Blue")
cor_blue<- by(blue_status, blue_status$gender,
              FUN = function(df) cor.test(df$x, df$y, method = "spearman"))

# extract the required data
theResults <- lapply(c("F","M","O"),function(x){
  aTest <- cor_blue[[x]]
  data.frame(status = "Blue",
             test = names(aTest$statistic),
             value = aTest$statistic,
             p_value = aTest$p.value,
             rho = aTest$estimate)
})

# rbind the results into a single data frame
blueResults <- do.call(rbind,theResults)
rbind(blueResults,redResults)

I recognize that I could abstract out the repeated code into a support function to reduce the total lines of code needed for the solution, but that's left as a trivial exercise for the reader.

  • Related