How to Create Count Columns for Many Variables in R-CodePudding

I have a dataset that looks something like this. Each event has 1-3 tags, that have colors listed in them. All the colors can occur in any of the 3 tag columns. An item can have more than one event.

> data.frame(item = c(1,1,2,3,4,4,4,5), event = seq(1,8), 
             tag1 = c("red","red","blue","green","red","yellow","black","purple"), 
             tag2 = c("blue","NA","NA","yellow","orange","NA","purple","red"), 
             tag3 = c("NA","NA","NA","red","magenta","NA","red","green"))
  item event   tag1   tag2    tag3
1    1     1    red   blue      NA
2    1     2    red     NA      NA
3    2     3   blue     NA      NA
4    3     4  green yellow     red
5    4     5    red orange magenta
6    4     6 yellow     NA      NA
7    4     7  black purple     red
8    5     8 purple    red   green

What I want to do is transform the data so that I have a row for each item, and then, a column for each possible color, with the value being whether or not that item was ever tagged with it.

For example:

| Item| Red | Blue | Green | etc |
| 1   | 1   | 1    | 0     | 0   |
| 2   | 0   | 0    | 1     | 0   |

I don't want to have to make the columns manually, because in my actual code, there are about 800 different "colors."

I am working with tidyverse to try and fix this, but am open to other packages that make it easier.

Runtime does matter, as I have more than 10 million events in the dataset.

CodePudding user response：

library(tidyverse)

df %>%
  na_if("NA") %>% 
  pivot_longer(cols = starts_with("tag")) %>% 
  drop_na() %>% 
  count(item, value) %>% 
  pivot_wider(names_from = value, 
              values_from = n, 
              values_fill = 0)

# A tibble: 5 × 9
   item  blue   red green yellow black magenta orange purple
  <dbl> <int> <int> <int>  <int> <int>   <int>  <int>  <int>
1     1     1     2     0      0     0       0      0      0
2     2     1     0     0      0     0       0      0      0
3     3     0     1     1      1     0       0      0      0
4     4     0     2     0      1     1       1      1      1
5     5     0     1     1      0     0       0      0      1

Requests:

df %>%
  na_if("NA") %>% 
  pivot_longer(cols = starts_with("tag")) %>% 
  drop_na() %>% 
  count(item, value) %>% 
  pivot_wider(names_from = value, 
              values_from = n, 
              values_fill = 0) %>% 
  mutate(grey = 0, 
         white = 0)

CodePudding user response：

Using table from base R

table(df1$item[row(df1[-c(1:2)])], unlist(df1[-(1:2)]))

-output

    black blue green magenta orange purple red yellow
  1     0    1     0       0      0      0   2      0
  2     0    1     0       0      0      0   0      0
  3     0    0     1       0      0      0   1      1
  4     1    0     0       1      1      1   2      1
  5     0    0     1       0      0      1   1      0

If we want missing combinations, create a factor column with levels specified

colors <- factor(unlist(df1[-(1:2)]), levels = c("black", "blue", 
   "gray", "green", "magenta", "orange", "purple", "red", "yellow", "white"))
items <- df1$item[row(df1[-c(1:2)])]
table(items, colors)

-output

    colors
items black blue gray green magenta orange purple red yellow white
    1     0    1    0     0       0      0      0   2      0     0
    2     0    1    0     0       0      0      0   0      0     0
    3     0    0    0     1       0      0      0   1      1     0
    4     1    0    0     0       1      1      1   2      1     0
    5     0    0    0     1       0      0      1   1      0     0