I have a dataframe which contains many binary categorical variables, and I would like to display a heatmap-like plot for all the observations, only displaying two colors for "yes" and "no" levels. I would then like to sort it so that those observations (ID) with the most "yes" in their row appear on top.
The sample dataset is provided here:
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
var1 = c('yes', 'yes', 'no', 'yes', 'no'),
var2 = c('no', 'yes', 'no', 'yes', 'no'),
var3 = c('yes', 'no', 'no', 'yes', 'yes'))
df1
ID var1 var2 var3
1 1 yes no yes
2 2 yes yes no
3 3 no no no
4 4 yes yes yes
5 5 no no yes
I tried using the heatmap()
function but I could not make it work. Can you please help me with that?
CodePudding user response:
You're on the right track with heatmap
. Turn the "yes" / "no" columns of your df into a matrix of 0's and 1's and disable some of the defaults such as scaling and ordering.
mat1 <- 1*(df1[,-1]=="yes")
> mat1
var1 var2 var3
[1,] 1 0 1
[2,] 1 1 0
[3,] 0 0 0
[4,] 1 1 1
[5,] 0 0 1
# You only need this step if you want the IDs to be shown beside the plot
rownames(mat1) <- rownames(df1)
> mat1
var1 var2 var3
1 1 0 1
2 1 1 0
3 0 0 0
4 1 1 1
5 0 0 1
# reorder the matrix by rowSums before plotting
heatmap(mat1[order(rowSums(mat1)),], scale = "none", Rowv = NA, Colv = NA)
You can change the colour scheme by specifying the col
parameter like
heatmap(mat1[order(rowSums(mat1)),], scale = "none", Rowv = NA, Colv = NA, col=c("lightgrey", "tomato"))
If you would prefer the plot to read left-to-right (one column per ID), just transpose the matrix
heatmap(t(mat1[order(rowSums(mat1)),]), scale = "none", Rowv = NA, Colv = NA)
CodePudding user response:
If you want to use ggplot
, you need to work in long format. I will use tidyverse
here:
library(tidyverse)
library(dplyr)
df_long <- df1 %>%
pivot_longer(cols = paste0("var",1:3))
order <- df_long %>%
group_by(ID)%>%
summarise(n = sum(value == "yes"))%>%
arrange(-n)%>%
pull(ID)
df_long %>%
mutate(ID = factor(ID,levels = order))%>%
ggplot(aes(ID,name,fill = value))
geom_tile()
The part with order
is to have a vector of your ID ordered by their number of yes. You then need to set the levels of the factor variable following this order, in order to have your heatmap ordered by the number of yes.