Home > Mobile >  Plotting variability of an outcome variable among many group categories
Plotting variability of an outcome variable among many group categories

Time:02-17

I would like to plot a graph that gives a sense of the amount of variability that a expenditure variable has within each one of many group categories.

There are two challenges: (i) the expenditure variable comes from a very skewed distribution which made me think the most suitable plot would be a boxplot, (ii) there are many group categories: around 200.

I tried the code below, where val_perday is the variable I would like to plot in the Y-axis (very skewed distribution), and pf_cbo is the group category variable (over 200 different groups).

val_phy %>% 
    ggplot(aes(pf_cbo,val_perday))  
    geom_boxplot()
    Error in rowSums(as.matrix(ok)) : 
      'Calloc' could not allocate memory (93244819 of 16 bytes)

I am getting the error message above which I don't know if it is coming from the fact that I have too many group categories. Is there a way to revert this or to plot another type of graph that informs variability over many group categories?

CodePudding user response:

Here is one approach to manually calculating the summary statistics for a boxplot and then plotting it with ggplot. I tried following the approach described in ?geom_boxplot (Summary statistics section).

The one downside of this approach is that you cannot display outliers, but with millions of rows I'm betting this is uninformative anyway.

I'm assuming you have some data structure with a value and a group.

library(tidyverse)
library(ggplot2)

df <- data.frame(x = rcauchy(1e3, seq(1, 100, length.out = 1000)), 
                 group = rep(LETTERS[1:5], each = 200))

With {dplyr} you can summarise the data as follows:

summary <- df %>%
  group_by(group) %>%
  summarise(
    min = min(x),
    max = max(x),
    q1  = quantile(x, 0.25),
    q2  = quantile(x, 0.50),
    q3  = quantile(x, 0.75),
    iqr = q3 - q1
  )

Next, you can use geom_boxplot() with stat = "identity" to manually input the summary statistics as aesthetics.

ggplot(summary, aes(group, fill = group))  
  geom_boxplot(
    stat = "identity",
    aes(lower  = q1,
        upper  = q3,
        middle = q2,
        ymin   = pmax(q1 - 1.5 * iqr, min),
        ymax   = pmin(q3   1.5 * iqr, max))
  )

Created on 2022-02-16 by the reprex package (v2.0.1)

EDIT: A {data.table} approach to summarising the data:

library(data.table)

setDT(df, key = "group")
summary <- df[
  , .(min = min(x), 
      max = max(x),
      q1  = quantile(x, 0.25),
      q2  = quantile(x, 0.50),
      q3  = quantile(x, 0.75)),
  by = "group"
][, iqr := q3 - q1]

CodePudding user response:

Here is a way.
Use boxplot.stats to compute the extreme of the lower whisker, the lower ‘hinge’, the median, the upper ‘hinge’ and the extreme of the upper whisker. Then plot the box plot elements (the box, the median line and the whiskers) one by one.

Note that with 200 groups, the plot is unreadable.

set.seed(2022)
val_phy <- data.frame(
  pf_cbo = sample(sprintf("Groupd", 1:250), 93e6, replace = TRUE),
  val_perday = runif(93e6, max = 100)
)

library(ggplot2)

agg <- aggregate(val_perday ~ pf_cbo, data = val_phy, \(x) boxplot.stats(x)$stats)
str(agg)
#> 'data.frame':    250 obs. of  2 variables:
#>  $ pf_cbo    : chr  "Group001" "Group002" "Group003" "Group004" ...
#>  $ val_perday: num [1:250, 1:5] 2.05e-04 1.42e-04 8.53e-05 2.61e-04 1.27e-04 ...
agg <- cbind(agg[1], agg[[2]])
#agg$x <- as.integer(sub("\\D ", "", agg$pf_cbo))
agg$x <- as.integer(factor(agg$pf_cbo))
str(agg)
#> 'data.frame':    250 obs. of  7 variables:
#>  $ pf_cbo: chr  "Group001" "Group002" "Group003" "Group004" ...
#>  $ 1     : num  2.05e-04 1.42e-04 8.53e-05 2.61e-04 1.27e-04 ...
#>  $ 2     : num  25 24.8 25 25 25.1 ...
#>  $ 3     : num  49.9 49.9 50.2 50.1 50.1 ...
#>  $ 4     : num  75 75 75 75 75 ...
#>  $ 5     : num  100 100 100 100 100 ...
#>  $ x     : int  1 2 3 4 5 6 7 8 9 10 ...

ggplot(agg, aes(x = x))  
  geom_rect(aes(xmin = x - 0.1, xmax = x   0.1, ymin = `2`, ymax = `4`),
            colour = "black", fill = "white")  
  geom_segment(aes(x = x - 0.1, xend = x   0.1, y = `3`, yend = `3`))  
  geom_segment(aes(xend = x, y = `2`, yend = `1`),
               arrow = arrow(angle = 90, length = unit(5, "points")))  
  geom_segment(aes(xend = x, y = `4`, yend = `5`),
               arrow = arrow(angle = 90, length = unit(5, "points")))

Created on 2022-02-16 by the reprex package (v2.0.1)

  • Related