Home > Enterprise >  How do I group by two variables in R and arrange them in the right order using ggplot?
How do I group by two variables in R and arrange them in the right order using ggplot?

Time:07-13

Having trouble grouping by two variables and having a correct descending order.

I'm using a modified Iris df for this, This is my script:

iris_new <- iris %>% 
  mutate(number_petals = sample(1:10, size = 150, replace = TRUE))
iris_new %>% 
  group_by(number_petals, Species) %>%
  summarise(n=sum(Petal.Length, na.rm=TRUE)) %>%
  arrange(desc(n), by_group = TRUE) %>%
  head(25) %>%
  ggplot(aes(x=reorder(number_petals,n),y=n,fill=factor(Species)))  
  xlab("Number of Petals") 
  ylab("Total sum of petal lenghts")  
  geom_col() #  coord_flip()

enter image description here

Two issues with this:

  1. It does not sort by descending ever since I added the second group_by argument (Species).

  2. Head(25) doesn't take the 25 highest sum of petal lengths per number_petals and per species. It takes the 25 highest petal lengths, agnostic of number_petals and Species.

I read that summarise() removes the second group_by variable, but I am not sure what to do with that information.

All help is greatly appreciated!

CodePudding user response:

Here is one way using factored approach -

We create two new columns n and n1 where n1 is sum of n values in each number_petals. arrange the data by n1 and create factor levels according to their occurrence. This factor level will ensure the order of the bars.

library(dplyr)
library(ggplot2)

iris_new %>% 
  group_by(Species, number_petals) %>%
  summarise(n=sum(Petal.Length, na.rm=TRUE), .groups = "drop") %>%
  group_by(number_petals) %>%
  mutate(n1 = sum(n)) %>%
  arrange(desc(n1)) %>%
  ungroup() %>%
  mutate(number_petals = factor(number_petals, unique(number_petals))) %>%
  ggplot(aes(x=number_petals,y=n,fill=Species))  
  xlab("Number of Petals") 
  ylab("Total sum of petal lenghts")  
  geom_col() 

enter image description here

head(25) would select first 25 rows irrespective of the grouping. If you want to select top rows by group looking into ?slice_max or ?slice in general.


To select top n (5 here) values here's a different approach using a join.

iris_new %>% 
  group_by(number_petals) %>%
  summarise(n=sum(Petal.Length, na.rm=TRUE), .groups = "drop") %>%
  slice_max(n, n = 5) %>%
  inner_join(iris_new %>% 
  group_by(Species, number_petals) %>%
  summarise(n1=sum(Petal.Length, na.rm=TRUE), .groups = "drop"), 
            by = 'number_petals') %>%
  arrange(desc(n)) %>%
  mutate(number_petals = factor(number_petals, unique(number_petals))) %>%
  ggplot(aes(x=number_petals,y=n1,fill=Species))  
  xlab("Number of Petals") 
  ylab("Total sum of petal lenghts")  
  geom_col() 

enter image description here

  •  Tags:  
  • r
  • Related