I have a collection of categorical data and I'm trying to figure out how best to visualize it. It is a "simple" list (97 categories long) with just a name and an associated value. Here's a sample to work with (in the actual set, the names are much longer):
names <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N",
"O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "AA", "AB", "AC", "AD",
"AE", "AF", "AG", "AH", "AI", "AJ", "AK", "AL", "AM", "AN", "AO", "AP", "AQ", "AR",
"AS", "AT", "AU", "AV", "AW", "AX", "AY", "AZ", "BA", "BB", "BC", "BD", "BE", "BF",
"BG", "BH", "BI", "BJ", "BK", "BL", "BM", "BN", "BO", "BP", "BQ", "BR", "BS", "BT",
"BU", "BV", "BW", "BX", "BY", "BZ", "CA", "CB", "CC", "CD", "CE", "CF", "CG", "CH",
"CI", "CJ", "CK", "CL", "CM", "CN", "CO", "CP", "CQ", "CR", "CS")
cts <- c(620, 343, 165, 121, 107, 106, 104, 88, 83, 59, 57, 56, 49, 45, 44, 37, 37,
37, 37, 35, 31, 31, 29, 27, 24, 23, 23, 22, 21, 21, 20, 20, 17, 17, 16, 16, 15, 15,
15, 14, 14, 13, 13, 12, 12, 12, 11, 11, 10, 10, 10, 9, 9, 8, 8, 7, 6, 5, 5, 5, 5, 5,
4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1)
testdat <- data.frame(names, cts)
My initial thought was to make a lollipop chart, but because there are so many points, R squishes them all together and it becomes a mess. Bar/column charts are out for the same reason. I thought it might be possible to make a plot in which all of the categories have a shape (a box?) with an area that corresponds to the cts
variable, but I haven't found anything like that (I tried the waffle
package, but I'm struggling with it). Everything that I've found seems to require at least two numerical values to plot in an x vs y way.
My next data set is in the same format, only with many more categories (9,351 instead of "just" 97), so I'm hoping for something I could expand.
Anyone have ideas on how to look at this data without cutting it up?
CodePudding user response:
Maybe a treemap?
library(ggplot2)
library(treemapify)
ggplot(testdat, aes(area = cts, fill = names))
geom_treemap()
geom_treemap_text(aes(label = names), place = 'center')
scale_fill_discrete(guide = 'none', limits = sample(testdat$names))
CodePudding user response:
There are two problems here. One is that a plot with 100 labelled categories is always going to be too busy. It might be better to plot the sorted values on a numeric axis, and label some important illustrative categories:
library(tidyverse)
library(ggrepel)
set.seed(1)
plot_dat <- testdat %>%
arrange(cts) %>%
mutate(num = seq(nrow(.)))
ggplot(plot_dat, aes(num, cts))
geom_point(colour = "deepskyblue4")
geom_label_repel(data = plot_dat[sample(nrow(plot_dat), 5), ],
aes(label = names), nudge_y = 0.5)
scale_y_log10()
theme_minimal()
ggtitle("Ordered cts values (log scale)")
labs(x = "")
theme(text = element_text(size = 16))
The other problem is filling 2d space with 1d information. You can do this with a treemap, or with packing circles:
library(packcircles)
library(ggplot2)
library(ggforce)
testdat <- cbind(testdat, circleRepelLayout(testdat$cts)$layout)
ggplot(testdat, aes(x0 = x, y0 = y, fill = radius))
geom_circle(aes(r = radius))
geom_text(aes(x, y, label = names, size = order(radius)))
coord_equal()
theme_void()
scale_fill_distiller(palette = "Pastel1")
theme(legend.position = "none")
Or just a waffle where the colours represent values:
library(tidyverse)
testdat %>%
mutate(x = rep(1:10, each = 10)[seq(nrow(.))],
y = rep(1:10, 10)[seq(nrow(.))]) %>%
ggplot(aes(x, y, fill = log(cts)))
geom_tile(width = 0.8, height = 0.8)
geom_text(aes(label = names), color = "white")
scale_fill_viridis_c(option = "E")
coord_equal()
theme_void()
CodePudding user response:
Simple approach:
Use a barchart with log transformed axis:
library(tidyverse)
testdat %>%
mutate(names = factor(names, levels = names)) %>%
ggplot(aes(x = names, y=log(cts), fill= log(cts)))
geom_col()
theme_bw()