Home > database >  Grouping the levels of a variable based on their mean in r
Grouping the levels of a variable based on their mean in r

Time:10-04

I want to group my levels based on the mean price of each group, is this the right way to do it?

ames.train.c <- ames.train.c %>%
  group_by(Neighborhood) %>%
   mutate(Neighborhood.Cat = ifelse(mean(price) < 140000, "A", 
            ifelse(mean(price) < 200000, "B",
            ifelse(mean(price) < 260000, "C",
            ifelse(mean(price) < 300000, "D",
            ifelse(mean(price) < 340000, "E"))))))

CodePudding user response:

I think this approach might help you

library(dplyr)

cut_breaks <- c(0,140000,200000,260000,300000,340000)
cut_labels <- c("A","B","C","D","E")

  ames.train.c %>%
  group_by(Neighborhood) %>%
  mutate(Neighborhood.Cat = cut(mean(price),cut_breaks,labels = cut_labels))

CodePudding user response:

You didn't give us the data so I had to prepare it myself.


library(tidyverse)

df = tibble(
  Neighborhood = rep(1:5, each=1000),
  price = c(rnorm(1000, 100000, 1000),
            rnorm(1000, 150000, 1000),
            rnorm(1000, 90000, 1000),
            rnorm(1000, 200000, 1000),
            rnorm(1000, 300000, 1000))
)

Now we will create a function for assigning categories.

f = function(data) data %>% mutate(
  Neighborhood.Cat = 
    case_when(
      mean(price) < 140000  ~ "A",
      mean(price) < 200000  ~ "B",
      mean(price) < 260000  ~ "C",
      mean(price) < 300000  ~ "D",
      mean(price) < 340000  ~ "E"
  ))

With this function, you can modify groups in the following way:

df = df %>% group_by(Neighborhood) %>% 
  group_modify(~f(.x)) 

Let's check the effect

df %>% group_by(Neighborhood) %>% 
  summarise(mean = mean(price),
            Cat = Neighborhood.Cat[1])

output

# A tibble: 5 x 3
  Neighborhood    mean Cat  
         <int>   <dbl> <chr>
1            1 100020. A    
2            2 150011. B    
3            3  89981. A    
4            4 200052. C    
5            5 299998. D  
  • Related