My original data is structured as such:
Article Channel1_qty Channel2_qty Channel3_qty
110 30 10 0
110 40 0 10
111 50 5 2
111 60 3 18
I'm ultimately trying to produce a df that shows sums of articles of clothing sold for each channel_qty as well as counts of the number of articles. Using the above example, it would look something like:
Article_count | channel | Sum (total article qty for channel)
2 1 180
2 2 18
2 3 30
I attempted to structure it this way with the following code, but it didn't work:
df %>%
select(Article,
channel1_qty,
channel2_qty,
channel3_qty) %>%
gather(key = "channel", value = "value", -Article) %>%
group_by(channel)
summarise(
Article_count = n_distinct(Article),
total = sum(value)
)
Tried a few variations of this. Thinking of doing it in separate steps or as a loop, if necessary. I'm thinking there must be an easier / more elegant way in dplyr, though. Thanks!
CodePudding user response:
You are on the right track to tidyr::gather()/tidyr::pivot_longer()
, followed by dplyr::group_by()
and then finally dplyr::summarize()
.
The regex in the names_pattern
argument just strips away everything except the actual channel number from the original column names.
library(tidyverse)
d <- structure(list(Article = c(110L, 110L, 111L, 111L), Channel1_qty = c(30L, 40L, 50L, 60L), Channel2_qty = c(10L, 0L, 5L, 3L), Channel3_qty = c(0L, 10L, 2L, 18L)), class = "data.frame", row.names = c(NA, -4L))
d %>%
pivot_longer(-Article,
names_pattern = "^Channel(.*)_qty",
names_to = "channel",
values_to = "qty") %>%
group_by(channel) %>%
summarize(Article_count = n_distinct(Article),
Sum = sum(qty))
#> # A tibble: 3 × 3
#> channel Article_count Sum
#> <chr> <int> <int>
#> 1 1 2 180
#> 2 2 2 18
#> 3 3 2 30
Created on 2022-08-04 by the reprex package (v2.0.1)