I want to bin an integer column and count how many values in each bin. I have a working solution, but it's limited as it wouldn't show bins that have no values in them.
Consider the hp
column in mtcars
. I want to bin it by intervals of 40
, to get the desired output:
# desired output
# ──────── ──────── ────────
# | lower | upper | count |
# ──────── ──────── ────────
# | 40 | 80 | 5 |
# | 80 | 120 | 10 |
# | 120 | 160 | 4 |
# | 160 | 200 | 6 |
# | 200 | 240 | 3 |
# | 240 | 280 | 3 |
# | 280 | 320 | 0 |
# | 320 | 360 | 1 |
# ──────── ──────── ────────
So far, I have the current code:
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
mtcars |>
group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10)) |>
tally() |>
separate(hp_interval, into = c("lower", "upper"), sep = ",") |>
mutate(across(c(lower, upper), ~str_extract(.x, "\\d ") |> as.numeric()))
#> # A tibble: 7 x 3
#> lower upper n
#> <dbl> <dbl> <int>
#> 1 40 80 5
#> 2 80 120 10
#> 3 120 160 4
#> 4 160 200 6
#> 5 200 240 3
#> 6 240 280 3 \ ❌ there's one bin missing here! ❌
#> 7 320 360 1 / the bin of 280-320 has 0 values and was implicitly removed
Since ggplot2::cut_interval()
is just a wrapper around cut()
, I can pass arguments to base::cut.default
via ...
. My question is whether there's a simple trick to force showing bins with 0
values.
CodePudding user response:
You miss .drop = FALSE
in group_by()
. It determines whether the groups formed by factor levels that don't appear in the data should be dropped or not.
mtcars |>
group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10), .drop = FALSE) |>
tally() |>
extract(hp_interval, into = c("lower", "upper"), regex = "(\\d ),(\\d )", convert = TRUE)
# A tibble: 8 × 3
lower upper n
<int> <int> <int>
1 40 80 5
2 80 120 10
3 120 160 4
4 160 200 6
5 200 240 3
6 240 280 3
7 280 320 0
8 320 360 1
CodePudding user response:
One way is to use complete
and complete the sequence, i.e (continuing your work)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
mtcars |>
group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10)) |>
tally() |>
separate(hp_interval, into = c("lower", "upper"), sep = ",") |>
mutate(across(c(lower, upper), ~str_extract(.x, "\\d ") |> as.numeric())) |>
complete(nesting(lower = seq(min(lower), max(lower), by = 40),
upper = seq(min(upper), max(upper), by = 40))) |>
mutate(n = replace_na(n, 0))
# A tibble: 8 × 3
lower upper n
<dbl> <dbl> <int>
1 40 80 5
2 80 120 10
3 120 160 4
4 160 200 6
5 200 240 3
6 240 280 3
7 280 320 0
8 320 360 1