Home > database >  Dplyr - summarising multiple variables
Dplyr - summarising multiple variables

Time:08-06

My original data is structured as such:

Article  Channel1_qty Channel2_qty Channel3_qty

 110        30             10           0
 110        40             0            10
 111        50             5            2
 111        60             3            18

I'm ultimately trying to produce a df that shows sums of articles of clothing sold for each channel_qty as well as counts of the number of articles. Using the above example, it would look something like:

Article_count | channel | Sum (total article qty for channel)
      2            1        180
      2            2        18
      2            3        30

I attempted to structure it this way with the following code, but it didn't work:

df %>%
  select(Article,
         channel1_qty, 
         channel2_qty,
         channel3_qty) %>% 
  gather(key = "channel", value = "value", -Article) %>%
  group_by(channel)
  summarise(
    Article_count = n_distinct(Article),
    total = sum(value)
  )

Tried a few variations of this. Thinking of doing it in separate steps or as a loop, if necessary. I'm thinking there must be an easier / more elegant way in dplyr, though. Thanks!

CodePudding user response:

You are on the right track to tidyr::gather()/tidyr::pivot_longer(), followed by dplyr::group_by() and then finally dplyr::summarize().

The regex in the names_pattern argument just strips away everything except the actual channel number from the original column names.

library(tidyverse)

d <- structure(list(Article = c(110L, 110L, 111L, 111L), Channel1_qty = c(30L, 40L, 50L, 60L), Channel2_qty = c(10L, 0L, 5L, 3L), Channel3_qty = c(0L, 10L, 2L, 18L)), class = "data.frame", row.names = c(NA, -4L))

d %>% 
  pivot_longer(-Article, 
               names_pattern = "^Channel(.*)_qty", 
               names_to = "channel", 
               values_to = "qty") %>% 
  group_by(channel) %>% 
  summarize(Article_count = n_distinct(Article),
            Sum = sum(qty))
#> # A tibble: 3 × 3
#>   channel Article_count   Sum
#>   <chr>           <int> <int>
#> 1 1                   2   180
#> 2 2                   2    18
#> 3 3                   2    30

Created on 2022-08-04 by the reprex package (v2.0.1)

  • Related