Home > Blockchain >  Summary Statistics for Multiple Boxplots
Summary Statistics for Multiple Boxplots

Time:04-01

I'm using ggplot2 to create figures and calculate summary statistics from a CSV with roughly 5,000 observations. The CSV I'm working with is structured like this:

point user HUC slope hand twi classification drainagewing
1 1 10194587 21 0 30 active channel small
2 1 18594037 20 0 20 active floodplain small
3 2 18594037 23 10 10 active floodplain small
4 2 18503863 27 25 7 inactive floodplain small
5 2 18503863 0 10 8 definitely not valley bottom medium
6 6 18503863 2 2 13 definitely not valley bottom medium
7 4 18503863 4 3 18 active floodplain medium
8 5 18503863 10 6 2 inactive floodplain medium
9 5 10194587 12 2 10 active channel large
10 2 10194587 6 1 29 active channel large

I want to create boxplots and calculate the summary stats for slope, twi, and hand values within small, medium, and large drainage wings -- so, essentially, 9 boxplots and 9 sets of summary stats.

For example, I started working on slope values across small, medium, and large drainage wings:

ggplot(data = vbet, mapping = aes(y = slope, x = drainagewing, fill = drainagewing))  
  geom_boxplot()  
  labs(title = "Distribution of Slope by Drainage Wing Size",
       x = "Drainage Wing Size",
       y = "Slope",
       fill = "Drainage Wing Size")

image of boxplots showing slope values across drainage wing sizes

I know how to get summary stats of the whole CSV (validation.csv), but I just don't know how to break them apart like I've described above.

Bonus question-- how do I organize x-axis items? For example, the default is to organize my boxes within the plot as (Large, Medium, Small) drainage wings. How can I customize the order here?

CodePudding user response:

I'm unsure what exactly you mean with 9 summary statistics, but the following might help you get to 9 boxplots.

First we read in the data. Next, we transform it so that the current slope, hand and twi columns form two name, value columns.

txt <- "point   user    HUC slope   hand    twi classification  drainagewing
1   1   10194587    21  0   30  active channel  small
2   1   18594037    20  0   20  active floodplain   small
3   2   18594037    23  10  10  active floodplain   small
4   2   18503863    27  25  7   inactive floodplain small
5   2   18503863    0   10  8   definitely not valley bottom    medium
6   6   18503863    2   2   13  definitely not valley bottom    medium
7   4   18503863    4   3   18  active floodplain   medium
8   5   18503863    10  6   2   inactive floodplain medium
9   5   10194587    12  2   10  active channel  large
10  2   10194587    6   1   29  active channel  large"

vbet <- read.table(text = txt, sep = "\t", header = TRUE)

long <- tidyr::pivot_longer(vbet, c(slope, twi, hand))

You can then use it with ggplot and facet on the names of your previous columns. You can control the order of the x-axis by setting the limits in the x-scale.

library(ggplot2)

ggplot(long, aes(drainagewing, value, fill = drainagewing))  
  geom_boxplot()  
  scale_x_discrete(limits = c("small", "medium", "large"))  
  facet_wrap(~ name)

Created on 2022-03-31 by the reprex package (v2.0.1)

CodePudding user response:

The other answer only addresses "how to make a plot for each group based on a column?", which is essentially a duplicate of many other threads. But besides the plotting question, OP also asks about calculating the summary of statistics for each group. Here, I am copying the same plotting code, but am adding a solution for calculating summaries as well.

library(tidyverse)

long <- tidyr::pivot_longer(vbet, c(slope, twi, hand))

ggplot(long, aes(drainagewing, value, fill = drainagewing))  
  geom_boxplot()  
  scale_x_discrete(limits = c("small", "medium", "large"))  
  facet_wrap(~ name)

long %>% 
  split(., list(.$name, .$drainagewing)) %>% 
  map(summary)

#> $hand.large
#>      point            user           HUC          
#>  Min.   : 9.00   Min.   :2.00   Min.   :10194587  
#>  1st Qu.: 9.25   1st Qu.:2.75   1st Qu.:10194587  
#>  Median : 9.50   Median :3.50   Median :10194587  
#>  Mean   : 9.50   Mean   :3.50   Mean   :10194587  
#>  3rd Qu.: 9.75   3rd Qu.:4.25   3rd Qu.:10194587  
#>  Max.   :10.00   Max.   :5.00   Max.   :10194587  
#>                       classification drainagewing     name          
#>  active_channel              :2      large :2     Length:2          
#>  active_floodplain           :0      medium:0     Class :character  
#>  definitely_not_valley_bottom:0      small :0     Mode  :character  
#>  inactive_floodplain         :0                                     
#>                                                                     
#>                                                                     
#>      value     
#>  Min.   :1.00  
#>  1st Qu.:1.25  
#>  Median :1.50  
#>  Mean   :1.50  
#>  3rd Qu.:1.75  
#>  Max.   :2.00  
#> 
#> $slope.large
#>      point            user           HUC          
#>  Min.   : 9.00   Min.   :2.00   Min.   :10194587  
#>  1st Qu.: 9.25   1st Qu.:2.75   1st Qu.:10194587  
#>  Median : 9.50   Median :3.50   Median :10194587  
#>  Mean   : 9.50   Mean   :3.50   Mean   :10194587  
#>  3rd Qu.: 9.75   3rd Qu.:4.25   3rd Qu.:10194587  
#>  Max.   :10.00   Max.   :5.00   Max.   :10194587  
#>                       classification drainagewing     name          
#>  active_channel              :2      large :2     Length:2          
#>  active_floodplain           :0      medium:0     Class :character  
#>  definitely_not_valley_bottom:0      small :0     Mode  :character  
#>  inactive_floodplain         :0                                     
#>                                                                     
#>                                                                     
#>      value     
#>  Min.   : 6.0  
#>  1st Qu.: 7.5  
#>  Median : 9.0  
#>  Mean   : 9.0  
#>  3rd Qu.:10.5  
#>  Max.   :12.0  
#> ------> Continued...
  • Related