Home > OS >  R dplyr group_by two different columns and summarise
R dplyr group_by two different columns and summarise

Time:02-21

Consider this data frame.

question <- data.frame("Product" = c("P001", "P001", "P001", "P002", "P002", "P002"),
                       "Activity" = c("sawing", "planning", "opening", "sawing", "planning", "opening"),
                       "Employee" = c("Tom", "Bert", "Louisa", "Bert", "Louisa", "Louisa"))
Product Activity Employee
P001 sawing Tom
P001 planning Bert
P001 opening Louisa
P002 sawing Bert
P002 planning Louisa
P002 opening Louisa

I would like to do 3 summarisations in one new table:

  • Group by activity and aggregate the number of activities;
  • Group by activity and summarise the number of employees that can do a specific process;
  • Group by product and summarise the number of employees that do a specific process, averaged per product.

It's the last summarisation that I can't figure out, since I need a different "group_by" for this. What I have so far is:

days <- question %>%
  group_by(Activity) %>%
  summarise("Number of activities" = n(),
            "Number of employees" = length(unique(Employee)))

So for example, as a result for the last summarisation, I would like to get:

Activity Number of activities Number of employees Average number of employees per activity per product
opening 2 1 1
planning 2 2 1
sawing 2 2 1

How do I do the last step/column?

CodePudding user response:

One option is to calculate the number of employees per product separately, and join this to the original dataset before calculating the final summary:

summary1 <- question %>%
  group_by(Activity, Product) %>%
  summarize(n_employees_per_prod_act = n(), .groups = "drop")

question %>%
  left_join(summary1, by = c("Activity", "Product")) %>%
  group_by(Activity) %>%
  summarise("Number of activities" = n(),
            "Number of employees" = length(unique(Employee)),
            "Average number of employees per activity per product" = mean(n_employees_per_prod_act))

# A tibble: 3 x 4
  Activity `Number of activ~ `Number of emplo~ `Average number of~
  <chr>                <int>             <int>               <dbl>
1 opening                  2                 1                   1
2 planning                 2                 2                   1
3 sawing                   2                 2                   1

CodePudding user response:

Here is my attempt. Please notice that I changed Number of activties to Number of products in the expected output.


question %>%
  nest_by(Activity, Product) |>
  transmute(uempl = unique(data$Employee),
            uempp = length(uempl)) |>
  group_by(Activity) |>
    summarize("Number of employees" = length(unique(uempl)),
              "Average number of employees per activity per product" = mean(uempp),
              "Number of products" = n())

  # A tibble: 3 × 4
  Activity `Number of employees` `Average number of employees…` `Number of pro…`
  <chr>                    <int>                          <dbl>            <int>
1 opening                      1                              1                2
2 planning                     2                              1                2
3 sawing                       2                              1                2

  • Related