I have a dataset that looks something like this. Each event has 1-3 tags, that have colors listed in them. All the colors can occur in any of the 3 tag columns. An item can have more than one event.
> data.frame(item = c(1,1,2,3,4,4,4,5), event = seq(1,8),
tag1 = c("red","red","blue","green","red","yellow","black","purple"),
tag2 = c("blue","NA","NA","yellow","orange","NA","purple","red"),
tag3 = c("NA","NA","NA","red","magenta","NA","red","green"))
item event tag1 tag2 tag3
1 1 1 red blue NA
2 1 2 red NA NA
3 2 3 blue NA NA
4 3 4 green yellow red
5 4 5 red orange magenta
6 4 6 yellow NA NA
7 4 7 black purple red
8 5 8 purple red green
What I want to do is transform the data so that I have a row for each item, and then, a column for each possible color, with the value being whether or not that item was ever tagged with it.
For example:
| Item| Red | Blue | Green | etc |
| 1 | 1 | 1 | 0 | 0 |
| 2 | 0 | 0 | 1 | 0 |
I don't want to have to make the columns manually, because in my actual code, there are about 800 different "colors."
I am working with tidyverse to try and fix this, but am open to other packages that make it easier.
Runtime does matter, as I have more than 10 million events in the dataset.
CodePudding user response:
library(tidyverse)
df %>%
na_if("NA") %>%
pivot_longer(cols = starts_with("tag")) %>%
drop_na() %>%
count(item, value) %>%
pivot_wider(names_from = value,
values_from = n,
values_fill = 0)
# A tibble: 5 × 9
item blue red green yellow black magenta orange purple
<dbl> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 2 0 0 0 0 0 0
2 2 1 0 0 0 0 0 0 0
3 3 0 1 1 1 0 0 0 0
4 4 0 2 0 1 1 1 1 1
5 5 0 1 1 0 0 0 0 1
Requests:
df %>%
na_if("NA") %>%
pivot_longer(cols = starts_with("tag")) %>%
drop_na() %>%
count(item, value) %>%
pivot_wider(names_from = value,
values_from = n,
values_fill = 0) %>%
mutate(grey = 0,
white = 0)
CodePudding user response:
Using table
from base R
table(df1$item[row(df1[-c(1:2)])], unlist(df1[-(1:2)]))
-output
black blue green magenta orange purple red yellow
1 0 1 0 0 0 0 2 0
2 0 1 0 0 0 0 0 0
3 0 0 1 0 0 0 1 1
4 1 0 0 1 1 1 2 1
5 0 0 1 0 0 1 1 0
If we want missing combinations, create a factor
column with levels
specified
colors <- factor(unlist(df1[-(1:2)]), levels = c("black", "blue",
"gray", "green", "magenta", "orange", "purple", "red", "yellow", "white"))
items <- df1$item[row(df1[-c(1:2)])]
table(items, colors)
-output
colors
items black blue gray green magenta orange purple red yellow white
1 0 1 0 0 0 0 0 2 0 0
2 0 1 0 0 0 0 0 0 0 0
3 0 0 0 1 0 0 0 1 1 0
4 1 0 0 0 1 1 1 2 1 0
5 0 0 0 1 0 0 1 1 0 0