How to conditionally filter using values from multiple columns inside grouped data?-CodePudding

I have a dataframe with data from an experiment. Currently, for each permutation of participant_id, experiment, side there is a value estimated by an algorithm. For some permutations however there is also an additional value which has been calculated manually.

The desired output is to select only one row per permutation of participant_id, experiment and side, based on the information given in value_replacement_info.

In other words, for some permutations there are currently two rows, with a value for each of "algorithm" and "manual" and I need to conditionally select which of the two rows to keep based on the string in value_replacement_info.

Specifically for permutations where there are rows for both "algorithm" and "manual" values, if the associated value_replacement_info is to "retain_algorithm_score" we should delete the row where value_type == "manual". Similarly, if value_replacement_info == "replace_with_manual_score", we should delete the row where value_type == "algorithm".

How would I go about this?

Example data and desired output:

library(dplyr)

# Example data
df1 <- tibble::tribble(
  ~participant_id, ~experiment,   ~side,     ~value, ~value_type,     ~value_replacement_info,
             "p1",         "a",  "left",   1.466667, "algorithm",                          NA,
             "p1",         "a", "right", -119.19438, "algorithm",                          NA,
             "p2",         "b", "right",         NA,    "manual",    "retain_algorithm_score",
             "p2",         "b", "right",   5.733333, "algorithm",                          NA,
             "p2",         "b",  "left", -310.29948, "algorithm",                          NA,
             "p3",         "a",  "left",   1.533333, "algorithm",                          NA,
             "p3",         "a",  "left",   -83.3325,    "manual", "replace_with_manual_score",
             "p3",         "a", "right",  -26.17686, "algorithm",                          NA,
             "p4",         "b", "right",        5.9, "algorithm",                          NA,
             "p4",         "b", "right",         NA,    "manual",    "retain_algorithm_score"
  )

# Desired output
desired_output <- tibble::tribble(
  ~participant_id, ~experiment,   ~side,     ~value, ~value_type,     ~value_replacement_info,
             "p1",         "a",  "left",   1.466667, "algorithm",                          NA,
             "p1",         "a", "right", -119.19438, "algorithm",                          NA,
             "p2",         "b", "right",   5.733333, "algorithm",                          NA,
             "p2",         "b",  "left", -310.29948, "algorithm",                          NA,
             "p3",         "a",  "left",   -83.3325,    "manual", "replace_with_manual_score",
             "p3",         "a", "right",  -26.17686, "algorithm",                          NA,
             "p4",         "b", "right",        5.9, "algorithm",                          NA
  )

My thinking so far has been to create a unique identifier for each permutation, thus:

library(tidyr)

df2 <- df1%>% 
  # create unique ID permutations
  unite("permutation_id", participant_id:side, remove = FALSE)

and then conditionally filtering within each permutation (after performing a group_by using the permutation_id). However, I can't get my head around how to filter value_type conditionally, based on the values in value_replacement_info.

Thanks in advance for any pointers!

CodePudding user response：

You could group and then use case_when to conditionally identify which value type to use:

df1 %>%
  group_by(participant_id, experiment, side) %>%
  mutate(row_to_keep = case_when(
    any(value_replacement_info == "retain_algorithm_score") ~ "algorithm",
    any(value_replacement_info == "replace_with_manual_score") ~ "manual",
    is.na(value_replacement_info) ~ value_type
  )) %>%
  slice(which(value_type == row_to_keep)) %>%
  select(-row_to_keep)

CodePudding user response：

One option is to replace the phrases in value_replacement_info with just algorithm or manual, then fill that in for each group. Then, we can filter to keep rows with NA (that we did not need to filter on), and those that have algorithm or manual in both columns (value_type and value_replacement_info).

library(tidyverse)

df1 %>%
  group_by(participant_id, experiment, side) %>%
  mutate(value_replacement_info2 = str_extract(value_replacement_info, "algorithm|manual")) %>%
  fill(value_replacement_info2, .direction = "updown") %>%
  filter(is.na(value_replacement_info2) |
           value_replacement_info2 == value_type) %>% 
  select(-last_col())

*Note: Here I created a duplicate column value_replacement_info2; however, if you don't need to retain that information then you could just use value_replacement_info instead.

Output

  participant_id experiment side    value value_type value_replacement_info   
  <chr>          <chr>      <chr>   <dbl> <chr>      <chr>                    
1 p1             a          left     1.47 algorithm  NA                       
2 p1             a          right -119.   algorithm  NA                       
3 p2             b          right    5.73 algorithm  NA                       
4 p2             b          left  -310.   algorithm  NA                       
5 p3             a          left   -83.3  manual     replace_with_manual_score
6 p3             a          right  -26.2  algorithm  NA                       
7 p4             b          right    5.9  algorithm  NA