Home > Software design >  Find cohorts in dataset in r dataframe
Find cohorts in dataset in r dataframe

Time:09-21

I'm trying to find the biggest cohort in a dataset of about 1000 candidates and 100 test questions. Every candidate is asked 15 questions out of a pool of 100 test questions. People in different cohorts make the same set of randomly sampled questions. I'm trying to find the largest group of candidates who all make the same test.

I'm working in R. The data.frame has about a 1000 rows, and 100 columns. Each column indicates which test question we're working with. For each row (candidate) all column entries are NA apart from the ones where a candidate filled in a particular question he or she was shown. The input in these question instances are either 0 or 1. (see picture)

Is there an elegant way to solve this? The only thing I could think of was using dplyer and filter per 15 question subset, and check how many rows still remain. However, with 100 columns this means it has to check (i think) 15 choose 100 different possibilities. Many thanks!

data.frame structure

CodePudding user response:

We can infer the cohort based on the NA pattern:

library(tidyverse)

answers <- tribble(
  ~candidate, ~q1, ~q2, ~q3,
  1,0,NA,NA,
  2,1,NA,NA,
  3,0,0,00
)
answers
#> # A tibble: 3 x 4
#>   candidate    q1    q2    q3
#>       <dbl> <dbl> <dbl> <dbl>
#> 1         1     0    NA    NA
#> 2         2     1    NA    NA
#> 3         3     0     0     0

# infer cohort by NA pattern
cohorts <-
  answers %>%
  group_by(candidate) %>%
  mutate_at(vars(-group_cols()), ~ ifelse(is.na(.x), NA, TRUE)) %>%
  unite(-candidate, col = "cohort")
cohorts
#> # A tibble: 3 x 2
#> # Groups:   candidate [3]
#>   candidate cohort        
#>       <dbl> <chr>         
#> 1         1 TRUE_NA_NA    
#> 2         2 TRUE_NA_NA    
#> 3         3 TRUE_TRUE_TRUE

answers %>%
  pivot_longer(-candidate) %>%
  left_join(cohorts) %>%
  # count filled answers per candidate and cohort
  group_by(cohort, candidate) %>%
  filter(! is.na(value)) %>%
  count() %>%
  # get the largest cohort
  arrange(-n) %>%
  pull(cohort) %>%
  first()
#> Joining, by = "candidate"
#> [1] "TRUE_TRUE_TRUE"

Created on 2021-09-21 by the reprex package (v2.0.1)

  • Related