Home > front end >  Create column based on distinctive sets of multiple row values
Create column based on distinctive sets of multiple row values

Time:12-28

I would like to identify unique sets of row values in a column based on row values in another column to ultimately create a new column in a dataframe. The following picture illustrates my problem and the expected result (expected_outcome column).

For example:

  • The first 3 rows have values T1 in the column trial, and values D1, D2, D3 in the column group.

  • The next 3 rows have values T3 in the column trial, and values D3, D2, D1 in the column group.

Because the set D1, D2, D3 has the same contain as D3, D2, D1, I want all the 6 rows to have the same value in column expected_outcome.

My data is way more complex than that. I may have to make this grouping over more than 2 columns. So, I prefer a generic solution to this problem. Below is the data in the picture.

test_data <- data.frame(
  trial            = c("T1", "T1", "T1", "T3", "T3", "T3", "T5", "T5", "T6", "T6", "T6"),
  group            = c("D1", "D2", "D3", "D3", "D2", "D1", "D1", "D3", "D1", "D3", "D2")
)

enter image description here

CodePudding user response:

You could do something like this via tidyverse.

library(tidyverse)

test_data %>%
  group_by(trial) %>%
  summarize(type = paste(sort(unique(group)), collapse = ", "), group) %>%
  group_by(type) %>%
  mutate(expected_outcome = cur_group_id()) %>%
  ungroup() %>%
  dplyr::select(-"type")

Output

# A tibble: 11 × 3
   trial group expected_outcome
   <chr> <chr>            <int>
 1 T1    D1                   1
 2 T1    D2                   1
 3 T1    D3                   1
 4 T3    D3                   1
 5 T3    D2                   1
 6 T3    D1                   1
 7 T5    D1                   2
 8 T5    D3                   2
 9 T6    D1                   1
10 T6    D3                   1
11 T6    D2                   1

CodePudding user response:

I think the last trial in your column should be T7?

library(dplyr)

test_data %>% 
  arrange(across(everything())) %>% 
  group_by(trial) %>% 
  mutate(expected_outcome = toString(group)) %>%
  group_by(expected_outcome) %>% 
  mutate(expected_outcome = cur_group_id())
    trial group expected_outcome
   <chr> <chr>            <int>
 1 T1    D1                   1
 2 T1    D2                   1
 3 T1    D3                   1
 4 T3    D1                   1
 5 T3    D2                   1
 6 T3    D3                   1
 7 T5    D1                   2
 8 T5    D3                   2
 9 T6    D1                   2
10 T6    D3                   2
11 T7    D2                   3

data:

test_data <- structure(list(trial = c("T1", "T1", "T1", "T3", "T3", "T3", 
"T5", "T5", "T6", "T6", "T7"), group = c("D1", "D2", "D3", "D3", 
"D2", "D1", "D1", "D3", "D1", "D3", "D2")), class = "data.frame", row.names = c(NA, 
-11L))

CodePudding user response:

factorize group column and, using ave calculate cumsum of duplicateds in each group. Then, where even numbers appear, cumsum the differences that equal 1.

(test_data <- within(test_data, { 
                    group_int <- as.integer(as.factor(group))
                    group_2 <- ave(group_int, group_int, FUN=\(x) 
                                   cumsum(duplicated(x)))
                    expected_outcome <- cumsum(c(1, diff(group_2 %% 2 == 0)) == 1)
                    rm(group_int, group_2)
}))
#    trial group expected_outcome
# 1     T1    D1                1
# 2     T1    D2                1
# 3     T1    D3                1
# 4     T3    D3                1
# 5     T3    D2                1
# 6     T3    D1                1
# 7     T5    D1                2
# 8     T5    D3                2
# 9     T6    D1                2
# 10    T6    D3                2
# 11    T6    D2                3

Note: R >= 4.1


Data:

test_data <- structure(list(trial = c("T1", "T1", "T1", "T3", "T3", "T3", 
"T5", "T5", "T6", "T6", "T6"), group = c("D1", "D2", "D3", "D3", 
"D2", "D1", "D1", "D3", "D1", "D3", "D2"), expected_outcome = c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L)), row.names = c(NA, -11L
), class = "data.frame")
  • Related