I want to create a histogram with data from two different conditions (A and B in the example below). I want to plot both distributions in the same plot using geom_histogram
in R.
However, it seems that for condition A, the distribution of the whole data set is shown (instead of only A).
In the example below, three cases are shown:
- Plotting A and B
- Plotting only A
- Plotting only B
You will see that the distribution of A is not the same when you compare 1) and 2).
Can anyone explain why this occurs and how to fix this problem?
set.seed(5)
# Create test data frame
test <- data.frame(
condition=factor(rep(c("A", "B"), each=200)),
value =c(rnorm(200, mean=12, sd=2.5), rnorm(200, mean=13, sd=2.1))
)
# Create separate data sets
test_a <- test[test$condition == "A",]
test_b <- test[test$condition == "B",]
# 1) Plot A and B
ggplot(test, aes(x=value, fill=condition))
geom_histogram(binwidth = 0.25, alpha=.5)
ggtitle("Test A and AB")
# 2) Plot only A
ggplot(test_a, aes(x=value, fill=condition))
geom_histogram(binwidth = 0.25, alpha=.5)
ggtitle("Test A")
# 3) Plot only B
ggplot(test_b, aes(x=value, fill=condition))
geom_histogram(binwidth = 0.25, alpha=.5)
ggtitle("Test B")
CodePudding user response:
An alternative for visualization, not to supplant MichaelDewar's answer:
ggab <- ggplot(test, aes(x=value, fill=condition))
geom_histogram(binwidth = 0.25, alpha=.5, position = "identity")
ggtitle("Test A and AB")
xlim(5, 20)
ylim(0, 13)
# 2) Plot only A
gga <- ggplot(test_a, aes(x=value, fill=condition))
geom_histogram(binwidth = 0.25, alpha=.5)
ggtitle("Test A")
xlim(5, 20)
ylim(0, 13)
# 3) Plot only B
ggb <- ggplot(test_b, aes(x=value, fill=condition))
geom_histogram(binwidth = 0.25, alpha=.5)
ggtitle("Test B")
xlim(5, 20)
ylim(0, 13)
library(patchwork) # solely for a quick side-by-side-by-side presentation
gga ggab ggb plot_annotation(title = 'position = "identity"')
The key in this visualization is adding position="identity"
to the first hist (the others do not need it).
Alternatively, one could use position="dodge"
(this is best viewed on the console, it's a bit difficult on this small snapshot).
And for perspective, position = "stack"
, the default, showing "A" with a demonstrably altered histogram.
CodePudding user response:
The plots are stacked in the A B plot. So the A bars start at the top of the B bars. Also, the scaling on the axes are different. It's also possible that the bins have different endpoints.
So, yes, the A B plot is showing the total distribution. The fill helps you see the contribution from each of the A and B.
If you want to overlay the two plots, use:
ggplot(mapping = aes(x=value, fill=condition))
geom_histogram(data = test_a, binwidth = 0.25, alpha=.5)
geom_histogram(data = test_b, binwidth = 0.25, alpha=.5)
ggtitle("Test A and AB")