Home > Software design >  How to bin an integer column with group_by() |> tally() and show all bins, even if no values in t
How to bin an integer column with group_by() |> tally() and show all bins, even if no values in t

Time:12-29

I want to bin an integer column and count how many values in each bin. I have a working solution, but it's limited as it wouldn't show bins that have no values in them.

Consider the hp column in mtcars. I want to bin it by intervals of 40, to get the desired output:

# desired output
#  ──────── ──────── ──────── 
# | lower  | upper  | count  |
#  ──────── ──────── ──────── 
# | 40     | 80     | 5      |
# | 80     | 120    | 10     |
# | 120    | 160    | 4      |
# | 160    | 200    | 6      |
# | 200    | 240    | 3      |
# | 240    | 280    | 3      |
# | 280    | 320    | 0      |
# | 320    | 360    | 1      |
#  ──────── ──────── ──────── 

So far, I have the current code:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

mtcars |> 
  group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10)) |> 
  tally() |> 
  separate(hp_interval, into = c("lower", "upper"), sep = ",") |> 
  mutate(across(c(lower, upper), ~str_extract(.x, "\\d ") |> as.numeric()))

#> # A tibble: 7 x 3
#>   lower upper     n
#>   <dbl> <dbl> <int>
#> 1    40    80     5
#> 2    80   120    10
#> 3   120   160     4
#> 4   160   200     6
#> 5   200   240     3
#> 6   240   280     3  \ ❌ there's one bin missing here! ❌
#> 7   320   360     1  / the bin of 280-320 has 0 values and was implicitly removed

Since ggplot2::cut_interval() is just a wrapper around cut(), I can pass arguments to base::cut.default via .... My question is whether there's a simple trick to force showing bins with 0 values.

CodePudding user response:

You miss .drop = FALSE in group_by(). It determines whether the groups formed by factor levels that don't appear in the data should be dropped or not.

mtcars |> 
  group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10), .drop = FALSE) |> 
  tally() |>
  extract(hp_interval, into = c("lower", "upper"), regex = "(\\d ),(\\d )", convert = TRUE)

# A tibble: 8 × 3
  lower upper     n
  <int> <int> <int>
1    40    80     5
2    80   120    10
3   120   160     4
4   160   200     6
5   200   240     3
6   240   280     3
7   280   320     0
8   320   360     1

CodePudding user response:

One way is to use complete and complete the sequence, i.e (continuing your work)

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

mtcars |> 
  group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10)) |> 
  tally() |> 
  separate(hp_interval, into = c("lower", "upper"), sep = ",") |> 
  mutate(across(c(lower, upper), ~str_extract(.x, "\\d ") |> as.numeric())) |>
  complete(nesting(lower = seq(min(lower), max(lower), by = 40), 
                   upper = seq(min(upper), max(upper), by = 40))) |>
  mutate(n = replace_na(n, 0))

# A tibble: 8 × 3
  lower upper     n
  <dbl> <dbl> <int>
1    40    80     5
2    80   120    10
3   120   160     4
4   160   200     6
5   200   240     3
6   240   280     3
7   280   320     0
8   320   360     1
  • Related