Home > Enterprise >  plot a heatmap for binary categorical variables in R
plot a heatmap for binary categorical variables in R

Time:09-07

I have a dataframe which contains many binary categorical variables, and I would like to display a heatmap-like plot for all the observations, only displaying two colors for "yes" and "no" levels. I would then like to sort it so that those observations (ID) with the most "yes" in their row appear on top.

The sample dataset is provided here:

df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
                   var1 = c('yes', 'yes', 'no', 'yes', 'no'),
                   var2 = c('no', 'yes', 'no', 'yes', 'no'),
                   var3 = c('yes', 'no', 'no', 'yes', 'yes'))
df1


  ID var1 var2 var3
1  1  yes   no  yes
2  2  yes  yes   no
3  3   no   no   no
4  4  yes  yes  yes
5  5   no   no  yes

I tried using the heatmap() function but I could not make it work. Can you please help me with that?

CodePudding user response:

You're on the right track with heatmap. Turn the "yes" / "no" columns of your df into a matrix of 0's and 1's and disable some of the defaults such as scaling and ordering.

mat1 <- 1*(df1[,-1]=="yes")

> mat1
     var1 var2 var3
[1,]    1    0    1
[2,]    1    1    0
[3,]    0    0    0
[4,]    1    1    1
[5,]    0    0    1

# You only need this step if you want the IDs to be shown beside the plot

rownames(mat1) <- rownames(df1)

> mat1
  var1 var2 var3
1    1    0    1
2    1    1    0
3    0    0    0
4    1    1    1
5    0    0    1

# reorder the matrix by rowSums before plotting

heatmap(mat1[order(rowSums(mat1)),], scale = "none", Rowv = NA, Colv = NA)

heatmap outcome

You can change the colour scheme by specifying the col parameter like

heatmap(mat1[order(rowSums(mat1)),], scale = "none", Rowv = NA, Colv = NA, col=c("lightgrey", "tomato"))

If you would prefer the plot to read left-to-right (one column per ID), just transpose the matrix

 heatmap(t(mat1[order(rowSums(mat1)),]), scale = "none", Rowv = NA, Colv = NA)

CodePudding user response:

If you want to use ggplot, you need to work in long format. I will use tidyverse here:


library(tidyverse)
library(dplyr)

df_long <- df1 %>%
  pivot_longer(cols = paste0("var",1:3))

order <- df_long %>%
  group_by(ID)%>%
  summarise(n = sum(value == "yes"))%>%
  arrange(-n)%>%
  pull(ID)

df_long %>%
  mutate(ID = factor(ID,levels = order))%>%
  ggplot(aes(ID,name,fill = value)) 
  geom_tile()

enter image description here

The part with order is to have a vector of your ID ordered by their number of yes. You then need to set the levels of the factor variable following this order, in order to have your heatmap ordered by the number of yes.

  • Related