I have a dataframe with data from an experiment. Currently, for each permutation of participant_id
, experiment
, side
there is a value estimated by an algorithm. For some permutations however there is also an additional value which has been calculated manually.
The desired output is to select only one row per permutation of participant_id
, experiment
and side
, based on the information given in value_replacement_info
.
In other words, for some permutations there are currently two rows, with a value for each of "algorithm" and "manual" and I need to conditionally select which of the two rows to keep based on the string in value_replacement_info
.
Specifically for permutations where there are rows for both "algorithm" and "manual" values, if the associated value_replacement_info
is to "retain_algorithm_score" we should delete the row where value_type == "manual". Similarly, if value_replacement_info
== "replace_with_manual_score", we should delete the row where value_type == "algorithm".
How would I go about this?
Example data and desired output:
library(dplyr)
# Example data
df1 <- tibble::tribble(
~participant_id, ~experiment, ~side, ~value, ~value_type, ~value_replacement_info,
"p1", "a", "left", 1.466667, "algorithm", NA,
"p1", "a", "right", -119.19438, "algorithm", NA,
"p2", "b", "right", NA, "manual", "retain_algorithm_score",
"p2", "b", "right", 5.733333, "algorithm", NA,
"p2", "b", "left", -310.29948, "algorithm", NA,
"p3", "a", "left", 1.533333, "algorithm", NA,
"p3", "a", "left", -83.3325, "manual", "replace_with_manual_score",
"p3", "a", "right", -26.17686, "algorithm", NA,
"p4", "b", "right", 5.9, "algorithm", NA,
"p4", "b", "right", NA, "manual", "retain_algorithm_score"
)
# Desired output
desired_output <- tibble::tribble(
~participant_id, ~experiment, ~side, ~value, ~value_type, ~value_replacement_info,
"p1", "a", "left", 1.466667, "algorithm", NA,
"p1", "a", "right", -119.19438, "algorithm", NA,
"p2", "b", "right", 5.733333, "algorithm", NA,
"p2", "b", "left", -310.29948, "algorithm", NA,
"p3", "a", "left", -83.3325, "manual", "replace_with_manual_score",
"p3", "a", "right", -26.17686, "algorithm", NA,
"p4", "b", "right", 5.9, "algorithm", NA
)
My thinking so far has been to create a unique identifier for each permutation, thus:
library(tidyr)
df2 <- df1%>%
# create unique ID permutations
unite("permutation_id", participant_id:side, remove = FALSE)
and then conditionally filtering within each permutation (after performing a group_by using the permutation_id). However, I can't get my head around how to filter value_type
conditionally, based on the values in value_replacement_info
.
Thanks in advance for any pointers!
CodePudding user response:
You could group and then use case_when
to conditionally identify which value type to use:
df1 %>%
group_by(participant_id, experiment, side) %>%
mutate(row_to_keep = case_when(
any(value_replacement_info == "retain_algorithm_score") ~ "algorithm",
any(value_replacement_info == "replace_with_manual_score") ~ "manual",
is.na(value_replacement_info) ~ value_type
)) %>%
slice(which(value_type == row_to_keep)) %>%
select(-row_to_keep)
CodePudding user response:
One option is to replace the phrases in value_replacement_info
with just algorithm
or manual
, then fill that in for each group. Then, we can filter
to keep rows with NA
(that we did not need to filter on), and those that have algorithm
or manual
in both columns (value_type
and value_replacement_info
).
library(tidyverse)
df1 %>%
group_by(participant_id, experiment, side) %>%
mutate(value_replacement_info2 = str_extract(value_replacement_info, "algorithm|manual")) %>%
fill(value_replacement_info2, .direction = "updown") %>%
filter(is.na(value_replacement_info2) |
value_replacement_info2 == value_type) %>%
select(-last_col())
*Note: Here I created a duplicate column value_replacement_info2
; however, if you don't need to retain that information then you could just use value_replacement_info
instead.
Output
participant_id experiment side value value_type value_replacement_info
<chr> <chr> <chr> <dbl> <chr> <chr>
1 p1 a left 1.47 algorithm NA
2 p1 a right -119. algorithm NA
3 p2 b right 5.73 algorithm NA
4 p2 b left -310. algorithm NA
5 p3 a left -83.3 manual replace_with_manual_score
6 p3 a right -26.2 algorithm NA
7 p4 b right 5.9 algorithm NA