Home > Software design >  How to represent data in the cross product of categories in R
How to represent data in the cross product of categories in R

Time:11-04

I have a data frame. In that data frame I have two categorical variables. If I consider the cross product of those 2, then I have one row per element in my data frame.

The table shown below is a table of frequencies. Instead of showing the number of occurrences I would like to show the column test. Any ideas on how to do this? I have been trying to figure it out but I am not finding a way.

enter image description here

CodePudding user response:

If I understand your problem correctly, it seems like you want to pivot the data from long form to wide form. You can do this pretty easily with dplyr and the pivot_wider function:

First we'll make some sample data (in the future, instead of including your data as an image, use the dput function to output it in format we can copy and paste to reproduce the problem):

df = data.frame(data_set=c('a','a','a','b','b','b','c','c','c'),
                category=rep.int(c('d','e','f'),3),
                tests=abs(rnorm(9,sd=3)),
                best_threads=rnorm(9, 50, 20))
  data_set category     tests best_threads
1        a        d 3.1162129     15.61119
2        a        e 4.0933109     19.71428
3        a        f 0.1026443     44.63157
4        b        d 3.3482561     47.68211
5        b        e 5.9149545     69.19018
6        b        f 4.2248788     52.54404
7        c        d 3.4384232     41.86539
8        c        e 4.2985273     76.49010
9        c        f 0.2164352     44.36635

Then we just pivot it to wide form, using data_set to be the new column names and tests as the value of those cells. We need to drop best_threads, either in the pivot_wider function, or beforehand:

library(dplyr)
pivot_wider(df,                        # Data frame to pivot
            -best_threads,             # Drop the best_threads variable 
            names_from = 'data_set',   # The variable to use for column names
            values_from = 'tests'      # The variable to put in the cells
    )

# A tibble: 3 × 4
  category     a     b     c
  <chr>    <dbl> <dbl> <dbl>
1 d        3.12   3.35 3.44 
2 e        4.09   5.91 4.30 
3 f        0.103  4.22 0.216
  • Related