Home > other >  R - multiple Boxplots, multiple subsets, multiple variables jitter
R - multiple Boxplots, multiple subsets, multiple variables jitter

Time:12-18

I need multiple boxplots in one plot, using different subsets of data and different variables. I did the following:

data_VAR <- subset(Data_HV_VAR, VAR == 1

data_NoVAR <- subset(Data_HV_VAR, VAR == 0)

boxplot(data_NoVAR$TorHeim, data_VAR$TorHeim,
        data_NoVAR$TorGast, data_VAR$TorGast, ylab = "number of goals", 
        names=c("home team", "home team","away team", "away team"),show.names=TRUE,
col=c('powderblue', 'seagreen3','powderblue', 'seagreen3'))

legend("topright", inset = c(0.01, 0.01),
       c("VAR","No VAR" ), fill=c("seagreen3","powderblue" ), box.col = "transparent", bg = "transparent", cex=0.8)

It worked, I could add multiple box plots to one plot using different subsets and different variables. But I couldn't figure out how to add data points using the jitter function. I couldn't add the data points to the depicted boxplot. I tried this:

stripchart(data_NoVAR$TorHeim, data_VAR$TorHeim, data_NoVAR$TorGast, data_VAR$TorGast, 
method = "jitter", 
vertical = TRUE,
pch = 1, add = TRUE, seed = 1, width = .3, col = "BLACK")

When using ggplot I could add the jitter, but couldn't figure out how to plot all 4 Boxplots in one plot. With ggplot I did the following:

ggplot(data = data_NoVAR, aes(x = 1, y = TorGast))   
geom_boxplot(fill = "powderblue")   scale_x_discrete()   labs( y = 
"number of goals", x = "away team") geom_point(size = 2, alpha =.3, 
position = position_jitter(seed = 1, width = .3))

Any ideas on both options? I would prefer ggplot (better design). But as long as I find a solution, both options are fine. Thank you for your comment :)

CodePudding user response:

When using ggplot or ggplot2 you should be able to group the data in different categories that can get plotted separately in the beginning. In the ggplot(aes(x=, y =, fill=__))

The fill allows a grouping different than the x axis (subsets of the x axis so to say).

p   ggplot(data = data_NoVAR, aes(x = 1, y = TorGast, fill = Category2))   
geom_boxplot(aes(fill = Category2))   scale_x_discrete()   labs( y = 
"number of goals", x = "away team")   geom_point(size = 2, alpha =.3, 
position = position_jitter(seed = 1, width = .3))

(Don't know the name of your second category. I wrote it as category2. Just change it in my code!)

CodePudding user response:

The general principle here is that when plotting using ggplot2, you should consider if your dataset is in what is referred to as "enter image description here

Now, let's try to recreate that in ggplot2.

Tidying the Data

As referenced, you first should be "tidying" your dataset. Using base graphics, you separate your dataset Data_HV_VAR into separate dataframes, then separate them further by specifying columns of data in those datasets. What would be nice is to just specify one dataset and have ggplot2 decide for us how the data is split up based on the values contained within our columns. We can't quite do that yet with Data_HV_VAR though, since the data is not quite "tidy".

Checking the names of your columns you have:

> head(Data_HV_VAR)

  VAR   TorHeim  TorGast
1   0  97.76619 70.24302
2   0 106.40246 54.83644
3   1 109.95918 46.84503
4   1  92.30629 56.06064
5   0  95.76701 61.09645
6   1  91.40704 51.44779

The column VAR is fine. This encodes the information related to that column in one way in one column for each observation. The "problem" structure are the columns TorHeim and TorGast. The same information is contained within both of these columns: (1) If the value should be from "Home Team" or "Away Team", and (2) the number of goals. The goal here is that you want to separate those two things out into separate columns: (1) one column for the team and (2) one column for number of goals. You can do that a variety of different ways (see melt(), reshape(), gather()). I'll show you one here using pivot_longer():

tidydf <- Data_HV_VAR %>%
  pivot_longer(cols = -"VAR", names_to="Team", values_to="goals")

> head(tidydf)
# A tibble: 6 x 3
    VAR Team    goals
  <int> <chr>   <dbl>
1     0 TorHeim  97.8
2     0 TorGast  70.2
3     0 TorHeim 106. 
4     0 TorGast  54.8
5     1 TorHeim 110. 
6     1 TorGast  46.8

That's better. There are a few things to do before plotting to make life easier, however. First, the data in tidydf$VAR is an integer, when it really should be a discrete value. We then want to change the values of tidydf$Team to be the actual names "Home Team" and "Away Team".. and ensure the order is correct. We can do all this in one line by using the %>% operator to pipe the commands together:

tidydf <- Data_HV_VAR %>%
  pivot_longer(cols = -"VAR", names_to="Team", values_to="goals") %>%
  
  # convert numeric VAR column to a factor (discrete)
  mutate(VAR=factor(VAR)) %>%
  
  # switch order of levels and also relabel them
  mutate(Team=factor(Team, levels=c("TorHeim", "TorGast"), labels=c("TorHeim"="Home Team", "TorGast"="Away Team")))

Here's the plot code and the resulting plot:

ggplot(tidydf, aes(x=Team, y=goals, fill=VAR))  
  geom_boxplot(position=position_dodge(width=1), width=0.8)  
  scale_fill_manual(values=c("0"="powderblue", "1"="seagreen3"))  
  theme_bw()  
  theme(
    legend.position=c(0.9, 0.9),
    legend.background = element_rect(color="black")
  )

enter image description here

Pretty cool! Here's what's going on:

  • We setup aesthetics so that the x axis is either home team or away team.
  • Y axis is the number of goals
  • The fill color of the boxplots is set to the VAR column. Since this is a factor, the color will be one of two different colors. Setting this also will force ggplot2 to group or separate the dataset based on this variable, so setting fill= in conjunction with x= effectively creates our 4 boxplots.
  • The scale_fill_manual(... function sets the actual color of the boxplots. This is not required, but if you didn't have it you would get the default colors.
  • Theme functions for the look of the plot.
  • Related