Home > Software engineering >  Calculating multiple columns from one column with summarise
Calculating multiple columns from one column with summarise

Time:08-17

Here is an example of what I'm trying to achieve:

df <- data.frame(label = c(rep("ABC", 5), rep("CDE", 5), rep("FGH", 5)), x = runif(15, 0, 100))

df %>% group_by(label) %>%
  summarise(across(everything(), list(lessthan_10 = ~sum(. < 10), lessthan_20 = ~sum(. < 20), lessthan_30 = ~sum(. < 20), lessthan_40 = ~sum(. < 40))))

In this case, I'm calculating 4 different columns in the summary (counting the entries less than 10, less than 20, less than 30, and less than 40). In reality, I would like to calculate 100 different columns using a custom function that takes in x and 100 different parameters. Is there a way to do this using a loop or a list without writing out every single column I want to calculate?

CodePudding user response:

You can use purrr::map_dfc:

library(tidyverse)
df %>% 
  group_by(label) %>% 
  summarise(map_dfc(seq(10, 40, 10), ~ tibble("x_lessthan_{.x}" := sum(x < .x))))
  label x_lessthan_10 x_lessthan_20 x_lessthan_30 x_lessthan_40
  <chr>         <int>         <int>         <int>         <int>
1 ABC               0             3             3             3
2 CDE               0             2             3             4
3 FGH               1             2             3             3

CodePudding user response:

Using the input in the Note at the end, for each label create 4 rows that each contains a list with the x values and and one of 10, 20, 30, 40. Then for each such row create the corresponding sum and sum name and finally convert that to wide form.

Change the paste0 to whatever formatting you want, e.g. sprintf("Less_than_%d", s) will force it not to use e notation.

library(dplyr)
library(tidyr)

df %>%
  group_by(label) %>%
  summarize(x = list(x), s = seq(10, 40, 10)) %>%
  rowwise %>%
  mutate(value = sum(x < s), name = paste0("Less_than_", s)) %>%
  pivot_wider(label)

giving:

# A tibble: 3 × 5
  label Less_than_10 Less_than_20 Less_than_30 Less_than_40
  <chr>        <int>        <int>        <int>        <int>
1 ABC              0            0            1            1
2 CDE              1            1            1            1
3 FGH              0            1            1            1

Note

To make the definition of df reproducible we add a set.seed .

set.seed(123)
df <- data.frame(label = c(rep("ABC", 5), rep("CDE", 5), rep("FGH", 5)), 
                 x = runif(15, 0, 100))

giving:

> df
   label        x
1    ABC 28.75775
2    ABC 78.83051
3    ABC 40.89769
4    ABC 88.30174
5    ABC 94.04673
6    CDE  4.55565
7    CDE 52.81055
8    CDE 89.24190
9    CDE 55.14350
10   CDE 45.66147
11   FGH 95.68333
12   FGH 45.33342
13   FGH 67.75706
14   FGH 57.26334
15   FGH 10.29247
  • Related