I'm using ggplot2 to create figures and calculate summary statistics from a CSV with roughly 5,000 observations. The CSV I'm working with is structured like this:
point | user | HUC | slope | hand | twi | classification | drainagewing |
---|---|---|---|---|---|---|---|
1 | 1 | 10194587 | 21 | 0 | 30 | active channel | small |
2 | 1 | 18594037 | 20 | 0 | 20 | active floodplain | small |
3 | 2 | 18594037 | 23 | 10 | 10 | active floodplain | small |
4 | 2 | 18503863 | 27 | 25 | 7 | inactive floodplain | small |
5 | 2 | 18503863 | 0 | 10 | 8 | definitely not valley bottom | medium |
6 | 6 | 18503863 | 2 | 2 | 13 | definitely not valley bottom | medium |
7 | 4 | 18503863 | 4 | 3 | 18 | active floodplain | medium |
8 | 5 | 18503863 | 10 | 6 | 2 | inactive floodplain | medium |
9 | 5 | 10194587 | 12 | 2 | 10 | active channel | large |
10 | 2 | 10194587 | 6 | 1 | 29 | active channel | large |
I want to create boxplots and calculate the summary stats for slope, twi, and hand values within small, medium, and large drainage wings -- so, essentially, 9 boxplots and 9 sets of summary stats.
For example, I started working on slope values across small, medium, and large drainage wings:
ggplot(data = vbet, mapping = aes(y = slope, x = drainagewing, fill = drainagewing))
geom_boxplot()
labs(title = "Distribution of Slope by Drainage Wing Size",
x = "Drainage Wing Size",
y = "Slope",
fill = "Drainage Wing Size")
I know how to get summary stats of the whole CSV (validation.csv), but I just don't know how to break them apart like I've described above.
Bonus question-- how do I organize x-axis items? For example, the default is to organize my boxes within the plot as (Large, Medium, Small) drainage wings. How can I customize the order here?
CodePudding user response:
I'm unsure what exactly you mean with 9 summary statistics, but the following might help you get to 9 boxplots.
First we read in the data. Next, we transform it so that the current slope
, hand
and twi
columns form two name
, value
columns.
txt <- "point user HUC slope hand twi classification drainagewing
1 1 10194587 21 0 30 active channel small
2 1 18594037 20 0 20 active floodplain small
3 2 18594037 23 10 10 active floodplain small
4 2 18503863 27 25 7 inactive floodplain small
5 2 18503863 0 10 8 definitely not valley bottom medium
6 6 18503863 2 2 13 definitely not valley bottom medium
7 4 18503863 4 3 18 active floodplain medium
8 5 18503863 10 6 2 inactive floodplain medium
9 5 10194587 12 2 10 active channel large
10 2 10194587 6 1 29 active channel large"
vbet <- read.table(text = txt, sep = "\t", header = TRUE)
long <- tidyr::pivot_longer(vbet, c(slope, twi, hand))
You can then use it with ggplot and facet on the names of your previous columns. You can control the order of the x-axis by setting the limits in the x-scale.
library(ggplot2)
ggplot(long, aes(drainagewing, value, fill = drainagewing))
geom_boxplot()
scale_x_discrete(limits = c("small", "medium", "large"))
facet_wrap(~ name)
Created on 2022-03-31 by the reprex package (v2.0.1)
CodePudding user response:
The other answer only addresses "how to make a plot for each group based on a column?", which is essentially a duplicate of many other threads. But besides the plotting question, OP also asks about calculating the summary of statistics for each group. Here, I am copying the same plotting code, but am adding a solution for calculating summaries as well.
library(tidyverse)
long <- tidyr::pivot_longer(vbet, c(slope, twi, hand))
ggplot(long, aes(drainagewing, value, fill = drainagewing))
geom_boxplot()
scale_x_discrete(limits = c("small", "medium", "large"))
facet_wrap(~ name)
long %>%
split(., list(.$name, .$drainagewing)) %>%
map(summary)
#> $hand.large
#> point user HUC
#> Min. : 9.00 Min. :2.00 Min. :10194587
#> 1st Qu.: 9.25 1st Qu.:2.75 1st Qu.:10194587
#> Median : 9.50 Median :3.50 Median :10194587
#> Mean : 9.50 Mean :3.50 Mean :10194587
#> 3rd Qu.: 9.75 3rd Qu.:4.25 3rd Qu.:10194587
#> Max. :10.00 Max. :5.00 Max. :10194587
#> classification drainagewing name
#> active_channel :2 large :2 Length:2
#> active_floodplain :0 medium:0 Class :character
#> definitely_not_valley_bottom:0 small :0 Mode :character
#> inactive_floodplain :0
#>
#>
#> value
#> Min. :1.00
#> 1st Qu.:1.25
#> Median :1.50
#> Mean :1.50
#> 3rd Qu.:1.75
#> Max. :2.00
#>
#> $slope.large
#> point user HUC
#> Min. : 9.00 Min. :2.00 Min. :10194587
#> 1st Qu.: 9.25 1st Qu.:2.75 1st Qu.:10194587
#> Median : 9.50 Median :3.50 Median :10194587
#> Mean : 9.50 Mean :3.50 Mean :10194587
#> 3rd Qu.: 9.75 3rd Qu.:4.25 3rd Qu.:10194587
#> Max. :10.00 Max. :5.00 Max. :10194587
#> classification drainagewing name
#> active_channel :2 large :2 Length:2
#> active_floodplain :0 medium:0 Class :character
#> definitely_not_valley_bottom:0 small :0 Mode :character
#> inactive_floodplain :0
#>
#>
#> value
#> Min. : 6.0
#> 1st Qu.: 7.5
#> Median : 9.0
#> Mean : 9.0
#> 3rd Qu.:10.5
#> Max. :12.0
#> ------> Continued...