In R, how do you compute multiple mean scores based on partial variable names in a single function /-CodePudding

Hi everyone hope you're well.

I was wondering if anyone has any idea how to create a set of mean scores based on the first few characters in a set of variables. The field is psychology and I am trying to score a personality trait instrument using concise code with no repetition, but I am coming short.

An example data frame and naming convention is below:

df <- tibble(
  psyAnger_01 = rnorm(10),
  psyAnger_02 = rnorm(10),
  psyAnger_03 = rnorm(10),
  psyAnger_04 = rnorm(10),
  narAnger_01 = rnorm(10),
  narAnger_02 = rnorm(10),
  narAnger_03 = rnorm(10),
  narAnger_04 = rnorm(10),
  psyArrog_01 = rnorm(10),
  psyArrog_02 = rnorm(10),
  psyArrog_03 = rnorm(10),
  psyArrog_04 = rnorm(10),
  )

In the real data frame I have dozens of variables (and multiple data frames) so I am trying to calculate means based on a partial string of the column name i.e., psyAnger. I can do this with a pmap() as below:

df <- df %>% mutate(psyAnger = pmap_dbl(
  select(., starts_with("psyAnger")),
  ~ mean(c(...))))

This works perfectly and produces a variable psyAnger with a mean of the other 4. Unfortunately I am now struggling to extend this out to the full data frame without copy and pasting and changing the variable names. For example:

df <- df %>% mutate(narAnger = pmap_dbl(
  select(., starts_with("narAnger")),
  ~ mean(c(...))))

I had the idea of trying to feed in a vector of scale names into the loop e.g., something like the below:

columns <- c("psyAnger", "narAnger", "psyArrog")

but I've got no clue idea how to integrate the two functions. Desired output would be all the variables having mean scores for each participant as the last columns in the dataset. This is my first time creating an MRE so please let me know if anything needs amending.'

CodePudding user response：

I maintain a package on github which helps with this kind of problems. We can use dplyover::over and extract the name stems with cut_names(). Then use across() to get all variables that start with each stem and calculate the rowMeans.

library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover

# when generating random data please set a seed for reproducability
set.seed(123)

# data
df <- tibble(
  psyAnger_01 = rnorm(10),
  psyAnger_02 = rnorm(10),
  psyAnger_03 = rnorm(10),
  psyAnger_04 = rnorm(10),
  narAnger_01 = rnorm(10),
  narAnger_02 = rnorm(10),
  narAnger_03 = rnorm(10),
  narAnger_04 = rnorm(10),
  psyArrog_01 = rnorm(10),
  psyArrog_02 = rnorm(10),
  psyArrog_03 = rnorm(10),
  psyArrog_04 = rnorm(10),
)

df %>% 
  transmute(over(cut_names("_\\d "),
              ~ rowMeans(across(starts_with(.x)))))

#> # A tibble: 10 x 3
#>    psyAnger narAnger psyArrog
#>       <dbl>    <dbl>    <dbl>
#>  1  0.00556   -0.138 -0.0716 
#>  2 -0.0959    -0.762  0.450  
#>  3  0.457     -0.159 -0.499  
#>  4  0.0826     0.452 -0.0967 
#>  5 -0.0575    -0.194  0.177  
#>  6  0.626      0.431 -0.00309
#>  7  0.588     -0.447  0.651  
#>  8 -0.785     -0.262 -0.0852 
#>  9 -0.357      0.502 -0.448  
#> 10 -0.0113     0.511  0.00431

^{Created on 2022-08-11 by the reprex package (v2.0.1)}

We can use a workaround to create a similar approach with purrr::map:

library(dplyr)
library(purrr)

var_nms <- gsub("_\\d ", "", names(df)) %>% unique()

df %>% 
  transmute(map_dfc(set_names(var_nms),
                     ~ rowMeans(across(starts_with(.x)))))
#> # A tibble: 10 x 3
#>    psyAnger narAnger psyArrog
#>       <dbl>    <dbl>    <dbl>
#>  1  0.00556   -0.138 -0.0716 
#>  2 -0.0959    -0.762  0.450  
#>  3  0.457     -0.159 -0.499  
#>  4  0.0826     0.452 -0.0967 
#>  5 -0.0575    -0.194  0.177  
#>  6  0.626      0.431 -0.00309
#>  7  0.588     -0.447  0.651  
#>  8 -0.785     -0.262 -0.0852 
#>  9 -0.357      0.502 -0.448  
#> 10 -0.0113     0.511  0.00431

^{Created on 2022-08-11 by the reprex package (v2.0.1)}

CodePudding user response：

OK. As I mentioned in one of my comments, your life will be much simpler if you invest a little time in making your data tidy. Here this means extracting information from the column names and putting it into the data frame.

Here's a quick and dirty way of doing that:

library(tidyverse)

dfLong <- df %>% 
  pivot_longer(
    everything(),
    names_to=c("Prefix", "Emotion", "Test"),
    names_sep=c(3, 8),
    values_to="Score"
  )
# A tibble: 120 × 4
   Prefix Emotion Test     Score
   <chr>  <chr>   <chr>    <dbl>
 1 psy    Anger   _01   -0.560  
 2 psy    Anger   _02    1.22   
 3 psy    Anger   _03   -1.07   
 4 psy    Anger   _04    0.426  
 5 nar    Anger   _01   -0.695  
 6 nar    Anger   _02    0.253  
 7 nar    Anger   _03    0.380  
 8 nar    Anger   _04   -0.491  
 9 psy    Arrog   _01    0.00576
10 psy    Arrog   _02    0.994  
# … with 110 more rows

There are very sophisticated options for defining how to pivot the data. You can use regular expressions in names_pattern, supply a separator character in names_sep and so on. Here, I've specified a vector of character positions. One obvious thing that could be tidied up is the leading underscore in Test. If I've misinterpreted the meaning of some aspects of the data, you can just rename the columns of dfLong to something more appropriate.

Now that's done, getting your summaries is straightforward.

dfLong %>% 
  group_by(Prefix, Emotion) %>% 
  summarise(
    N=n(),
    Mean=mean(Score),
    .groups="drop"
  )
# A tibble: 3 × 4
  Prefix Emotion     N     Mean
  <chr>  <chr>   <int>    <dbl>
1 nar    Anger      40 -0.00672
2 psy    Anger      40  0.0452 
3 psy    Arrog      40  0.00786

This code is robust in the sense that it will work without alteration regardless of the number of tests, the names of the emotions being tested and the different prefixes being used, it will work without alteration.

It occurs to me that you said you want "a new variable that is a mean of the 4 columns", as if you have repeated measurements on a series of ten subjects. That's also easy. We just have to intruduce a new variable denoting the subject before converting to long format.

df %>% 
  mutate(Subject=row_number()) %>% 
  pivot_longer(
    -Subject,
    names_to=c("Prefix", "Emotion", "Test"),
    names_sep=c(3, 8),
    values_to="Score"
  ) %>% 
  group_by(Subject, Prefix, Emotion) %>% 
  summarise(
    N=n(),
    Mean=mean(Score),
    .groups="drop"
  )
# A tibble: 30 × 5
   Subject Prefix Emotion     N     Mean
     <int> <chr>  <chr>   <int>    <dbl>
 1       1 nar    Anger       4 -0.138  
 2       1 psy    Anger       4  0.00556
 3       1 psy    Arrog       4 -0.0716 
 4       2 nar    Anger       4 -0.762  
 5       2 psy    Anger       4 -0.0959 
 6       2 psy    Arrog       4  0.450  
 7       3 nar    Anger       4 -0.159  
 8       3 psy    Anger       4  0.457  
 9       3 psy    Arrog       4 -0.499  
10       4 nar    Anger       4  0.452  
# … with 20 more rows

My comments about robustness apply equally to this variation, with the addition that it's also robust with respect to the number of subjects.

For reproducibility, I created the test df with the following code:

# For reprducibility
set.seed(123)

df <- tibble(
  psyAnger_01 = rnorm(10),
  psyAnger_02 = rnorm(10),
  psyAnger_03 = rnorm(10),
  psyAnger_04 = rnorm(10),
  narAnger_01 = rnorm(10),
  narAnger_02 = rnorm(10),
  narAnger_03 = rnorm(10),
  narAnger_04 = rnorm(10),
  psyArrog_01 = rnorm(10),
  psyArrog_02 = rnorm(10),
  psyArrog_03 = rnorm(10),
  psyArrog_04 = rnorm(10),
)