Home > Software engineering >  Analyze ranked-choice data
Analyze ranked-choice data

Time:11-26

I'm currently doing research and collecting a ranked-choice data. Basically people choosing their preferences in a topic. E.g., people ranking their preference on fruits: orange, mango, apple, avocado

The clean data frame looks like this:

            Fruits                          Color
1   orange;apple;banana;avocado     blue;yellow;red;green
2   avocado;apple;banana;orange     red;green;blue;yellow
3   apple;banana;orange;avocado     yellow;red;green;blue
4   banana;orange;apple;avocado     green;blue;red;yellow
5   apple;avocado;banana;orange     yellow;blue;green;red

The first person put orange as their first preference, then apple, banana, and avocado as the last preference. and so on

Scoring: 1st preference = 4; 2nd preference = 3; 3rd preference = 2; 4th preference = 1

Desired result

        apple   avocado banana  orange  blue    green   red yellow
    1   3       1       2       4       4       1       2       3
    2   3       4       2       1       2       3       4       1
    3   4       1       3       2       1       2       3       4
    4   2       1       4       3       3       4       2       1
    5   4       3       2       1       3       2       1       4

The part that I confused is to figure out how to give score for each column -> turn from semicolon separated string into column with numeric value. If I can pass this, I can create the desired output dataframe.

I've found pmr package, but the documentation only a few. Moreover, that package is too advance. I don't really need that for current state, just need simple scores for each preferences

Please help me at the scoring stage

CodePudding user response:

Here's an approach that works using a few lapply() and vapply() calls, but will generalize to more columns.

library(tibble)
d <- tibble::tribble(
  ~Fruits,                          ~Color,
     "orange;apple;banana;avocado",     "blue;yellow;red;green",
     "avocado;apple;banana;orange",     "red;green;blue;yellow",
     "apple;banana;orange;avocado",     "yellow;red;green;blue",
     "banana;orange;apple;avocado",     "green;blue;red;yellow",
     "apple;avocado;banana;orange",     "yellow;blue;green;red"
)


x <- lapply(unname(d), \(col) {
  l <- col %>% strsplit(";")
  x <- l[[1]] %>% unique() %>% sort()
  out <- lapply(x, \(x) {
    vapply(l, FUN.VALUE = numeric(1), \(vec) which(rev(vec) == x))
  })
  names(out) <- x
  as.data.frame(out)
})

do.call(cbind, x)
  apple avocado banana orange blue green red yellow
1     3       1      2      4    4     1   2      3
2     3       4      2      1    2     3   4      1
3     4       1      3      2    1     2   3      4
4     2       1      4      3    3     4   2      1
5     4       3      2      1    3     2   1      4

CodePudding user response:

Here’s a tidyverse solution built on tidyr::separate_rows() to break apart the items and dplyr::n() 1 - dplyr::row_number() to get reversed ranks by person and category.

library(dplyr)
library(tidyr)

prefs %>%
  mutate(id = row_number()) %>%
  separate_rows(!id) %>%
  pivot_longer(
    !id,
    names_to = "category",
    values_to = "item"
  ) %>%
  group_by(id, category) %>%
  mutate(rank = n()   1 - row_number()) %>%
  ungroup() %>%
  arrange(category) %>%  
  pivot_wider(
    id_cols = id,
    names_from = item,
    values_from = rank
  )
# A tibble: 5 × 9
     id  blue yellow   red green orange apple banana avocado
  <int> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>   <dbl>
1     1     4      3     2     1      4     3      2       1
2     2     2      1     4     3      1     3      2       4
3     3     1      4     3     2      2     4      3       1
4     4     3      1     2     4      3     2      4       1
5     5     3      4     1     2      1     4      2       3
  • Related