Hi everyone hope you're well.
I was wondering if anyone has any idea how to create a set of mean scores based on the first few characters in a set of variables. The field is psychology and I am trying to score a personality trait instrument using concise code with no repetition, but I am coming short.
An example data frame and naming convention is below:
df <- tibble(
psyAnger_01 = rnorm(10),
psyAnger_02 = rnorm(10),
psyAnger_03 = rnorm(10),
psyAnger_04 = rnorm(10),
narAnger_01 = rnorm(10),
narAnger_02 = rnorm(10),
narAnger_03 = rnorm(10),
narAnger_04 = rnorm(10),
psyArrog_01 = rnorm(10),
psyArrog_02 = rnorm(10),
psyArrog_03 = rnorm(10),
psyArrog_04 = rnorm(10),
)
In the real data frame I have dozens of variables (and multiple data frames) so I am trying to calculate means based on a partial string of the column name i.e., psyAnger
. I can do this with a pmap()
as below:
df <- df %>% mutate(psyAnger = pmap_dbl(
select(., starts_with("psyAnger")),
~ mean(c(...))))
This works perfectly and produces a variable psyAnger
with a mean of the other 4. Unfortunately I am now struggling to extend this out to the full data frame without copy and pasting and changing the variable names. For example:
df <- df %>% mutate(narAnger = pmap_dbl(
select(., starts_with("narAnger")),
~ mean(c(...))))
I had the idea of trying to feed in a vector of scale names into the loop e.g., something like the below:
columns <- c("psyAnger", "narAnger", "psyArrog")
but I've got no clue idea how to integrate the two functions. Desired output would be all the variables having mean scores for each participant as the last columns in the dataset. This is my first time creating an MRE so please let me know if anything needs amending.'
CodePudding user response:
I maintain a package on github which helps with this kind of problems. We can use dplyover::over
and extract the name stems with cut_names()
. Then use across()
to get all variables that start with each stem and calculate the rowMeans
.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
# when generating random data please set a seed for reproducability
set.seed(123)
# data
df <- tibble(
psyAnger_01 = rnorm(10),
psyAnger_02 = rnorm(10),
psyAnger_03 = rnorm(10),
psyAnger_04 = rnorm(10),
narAnger_01 = rnorm(10),
narAnger_02 = rnorm(10),
narAnger_03 = rnorm(10),
narAnger_04 = rnorm(10),
psyArrog_01 = rnorm(10),
psyArrog_02 = rnorm(10),
psyArrog_03 = rnorm(10),
psyArrog_04 = rnorm(10),
)
df %>%
transmute(over(cut_names("_\\d "),
~ rowMeans(across(starts_with(.x)))))
#> # A tibble: 10 x 3
#> psyAnger narAnger psyArrog
#> <dbl> <dbl> <dbl>
#> 1 0.00556 -0.138 -0.0716
#> 2 -0.0959 -0.762 0.450
#> 3 0.457 -0.159 -0.499
#> 4 0.0826 0.452 -0.0967
#> 5 -0.0575 -0.194 0.177
#> 6 0.626 0.431 -0.00309
#> 7 0.588 -0.447 0.651
#> 8 -0.785 -0.262 -0.0852
#> 9 -0.357 0.502 -0.448
#> 10 -0.0113 0.511 0.00431
Created on 2022-08-11 by the reprex package (v2.0.1)
We can use a workaround to create a similar approach with purrr::map
:
library(dplyr)
library(purrr)
var_nms <- gsub("_\\d ", "", names(df)) %>% unique()
df %>%
transmute(map_dfc(set_names(var_nms),
~ rowMeans(across(starts_with(.x)))))
#> # A tibble: 10 x 3
#> psyAnger narAnger psyArrog
#> <dbl> <dbl> <dbl>
#> 1 0.00556 -0.138 -0.0716
#> 2 -0.0959 -0.762 0.450
#> 3 0.457 -0.159 -0.499
#> 4 0.0826 0.452 -0.0967
#> 5 -0.0575 -0.194 0.177
#> 6 0.626 0.431 -0.00309
#> 7 0.588 -0.447 0.651
#> 8 -0.785 -0.262 -0.0852
#> 9 -0.357 0.502 -0.448
#> 10 -0.0113 0.511 0.00431
Created on 2022-08-11 by the reprex package (v2.0.1)
CodePudding user response:
OK. As I mentioned in one of my comments, your life will be much simpler if you invest a little time in making your data tidy. Here this means extracting information from the column names and putting it into the data frame.
Here's a quick and dirty way of doing that:
library(tidyverse)
dfLong <- df %>%
pivot_longer(
everything(),
names_to=c("Prefix", "Emotion", "Test"),
names_sep=c(3, 8),
values_to="Score"
)
# A tibble: 120 × 4
Prefix Emotion Test Score
<chr> <chr> <chr> <dbl>
1 psy Anger _01 -0.560
2 psy Anger _02 1.22
3 psy Anger _03 -1.07
4 psy Anger _04 0.426
5 nar Anger _01 -0.695
6 nar Anger _02 0.253
7 nar Anger _03 0.380
8 nar Anger _04 -0.491
9 psy Arrog _01 0.00576
10 psy Arrog _02 0.994
# … with 110 more rows
There are very sophisticated options for defining how to pivot the data. You can use regular expressions in names_pattern
, supply a separator character in names_sep
and so on. Here, I've specified a vector of character positions. One obvious thing that could be tidied up is the leading underscore in Test
. If I've misinterpreted the meaning of some aspects of the data, you can just rename the columns of dfLong
to something more appropriate.
Now that's done, getting your summaries is straightforward.
dfLong %>%
group_by(Prefix, Emotion) %>%
summarise(
N=n(),
Mean=mean(Score),
.groups="drop"
)
# A tibble: 3 × 4
Prefix Emotion N Mean
<chr> <chr> <int> <dbl>
1 nar Anger 40 -0.00672
2 psy Anger 40 0.0452
3 psy Arrog 40 0.00786
This code is robust in the sense that it will work without alteration regardless of the number of tests, the names of the emotions being tested and the different prefixes being used, it will work without alteration.
It occurs to me that you said you want "a new variable that is a mean of the 4 columns", as if you have repeated measurements on a series of ten subjects. That's also easy. We just have to intruduce a new variable denoting the subject before converting to long format.
df %>%
mutate(Subject=row_number()) %>%
pivot_longer(
-Subject,
names_to=c("Prefix", "Emotion", "Test"),
names_sep=c(3, 8),
values_to="Score"
) %>%
group_by(Subject, Prefix, Emotion) %>%
summarise(
N=n(),
Mean=mean(Score),
.groups="drop"
)
# A tibble: 30 × 5
Subject Prefix Emotion N Mean
<int> <chr> <chr> <int> <dbl>
1 1 nar Anger 4 -0.138
2 1 psy Anger 4 0.00556
3 1 psy Arrog 4 -0.0716
4 2 nar Anger 4 -0.762
5 2 psy Anger 4 -0.0959
6 2 psy Arrog 4 0.450
7 3 nar Anger 4 -0.159
8 3 psy Anger 4 0.457
9 3 psy Arrog 4 -0.499
10 4 nar Anger 4 0.452
# … with 20 more rows
My comments about robustness apply equally to this variation, with the addition that it's also robust with respect to the number of subjects.
For reproducibility, I created the test df
with the following code:
# For reprducibility
set.seed(123)
df <- tibble(
psyAnger_01 = rnorm(10),
psyAnger_02 = rnorm(10),
psyAnger_03 = rnorm(10),
psyAnger_04 = rnorm(10),
narAnger_01 = rnorm(10),
narAnger_02 = rnorm(10),
narAnger_03 = rnorm(10),
narAnger_04 = rnorm(10),
psyArrog_01 = rnorm(10),
psyArrog_02 = rnorm(10),
psyArrog_03 = rnorm(10),
psyArrog_04 = rnorm(10),
)