I've got a tibble of data in long format where some observations are missing --- think sales of different articles on different dates, where an article might not have been sold on a given day at all.
I want to complete()
this data so that all combinations of article and date are present in the tibble (this is for creating area plots, since ggplot
's geom_area()
apparently averages neighboring values for the missing data otherwise).
The original tibble contains both an ID number and a textual label for each article, and I would like to also add the same textual labels to the complete()
d tibble (this is so that I'll then be able to glue()
together a more informative label for a subsequent plot); I'm unsure how to do this in a tidy manner, however.
The mapping of ID numbers to labels is available in another tibble.
Here is an artificial example:
library(tidyverse)
articleData = tibble(article = 1:3, description = c("Foo", "Bar", "Baz"), price = c(1, 2, 2))
article <- c(1:3, 1, 3, 1:3)
description <- map_chr(article, ~ articleData[[., "description"]])
date <- as.Date(c(rep("2023-01-01", times = 3), rep("2023-01-02", times = 2), rep("2023-01-03", times = 3)))
sales <- c(21, 58, 35, 32, 47, 25, 47, 42)
price <- map_int(article, ~ articleData[[., "price"]])
data <- tibble(
Date = date,
Article = as_factor(article),
ArticleName = description,
Sales = sales,
Price = price,
Turnover = sales * price
)
data
data %>%
group_by(Date) %>%
summarize(
Date = unique(Date),
TurnoverTotal = sum(Turnover),
SalesTotal = sum(Sales)
) %>%
full_join(data) %>%
mutate(
TurnoverShare = Turnover / TurnoverTotal,
SalesShare = Sales / SalesTotal
) ->
data
data
# This does not work.
data %>% ggplot(mapping = aes(x = Date, y = TurnoverShare, fill = Article))
geom_area()
# Explicitly set missing shares to zero.
data %>% complete(
Date,
Article,
fill = list(
SalesShare = 0,
TurnoverShare = 0
)
) -> data2
data2
# This works.
data2 %>% ggplot(mapping = aes(x = Date, y = TurnoverShare, fill = Article))
geom_area()
In this example, article 2 (the "Bar") was not sold no 2023-01-02, so ArticleName
is NA
in the relevant row in data2
. Is there a way to directly set ArticleName
based on Article
within the call to complete()
, using the articleData
tibble? Alternatively, is there a good (tidy) way of adding this afterward?
Thank you.
CodePudding user response:
Naturally, one possible way of doing this occurred to me five minutes after posting this question; here I can do
data %>% complete(
Date,
Article,
fill = list(
SalesShare = 0,
TurnoverShare = 0
)) %>%
mutate(
ArticleName = map_chr(as.integer(Article), ~ articleData[[., "description"]]),
Price = map_int(as.integer(Article), ~ articleData[[., "price"]])
) -> data2
when creating the complete()
'd tibble --- as.integer()
being the key requirement. I would still welcome better, tidier and/or more elegant solutions, however.
CodePudding user response:
Yes, you can achieve this using complete()
provided that the names and types match in both data frames.
library(dplyr)
library(tidyr)
# Name and types must match in x and y data frames
articleData <- tibble(article = 1:3, description = c("Foo", "Bar", "Baz"), price = c(1, 2, 2)) %>%
rename(Article = article, ArticleName = description, Price = price) %>%
mutate(Article = factor(Article))
data %>%
complete(
Date,
articleData,
fill = list(
SalesShare = 0,
TurnoverShare = 0
)
)
# A tibble: 9 × 10
Date Article ArticleName Price TurnoverTotal SalesTotal Sales Turnover TurnoverShare SalesShare
<date> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2023-01-01 1 Foo 1 207 114 21 21 0.101 0.184
2 2023-01-01 2 Bar 2 207 114 58 116 0.560 0.509
3 2023-01-01 3 Baz 2 207 114 35 70 0.338 0.307
4 2023-01-02 1 Foo 1 126 79 32 32 0.254 0.405
5 2023-01-02 2 Bar 2 NA NA NA NA 0 0
6 2023-01-02 3 Baz 2 126 79 47 94 0.746 0.595
7 2023-01-03 1 Foo 1 203 114 25 25 0.123 0.219
8 2023-01-03 2 Bar 2 203 114 47 94 0.463 0.412
9 2023-01-03 3 Baz 2 203 114 42 84 0.414 0.368