Creating categorical income variable-CodePudding

I have a dataframe with the following format:

ID	Income	Year
1	100	2000
2	200	2000
3	300	2000
4	500	2001
5	1000	2001
6	1500	2001
7	10000	2002
8	15000	2002
9	20000	2002

I'd like to add a column called income_cat with three possible levels; "low", "medium" and "high" depending on whether the income is in the lower 33th percentile, the middle 33th percentile or the top 33th percentile of that specific year.

ID	Income	Year	income_cat
1	100	2000	low
2	200	2000	medium
3	300	2000	high
4	500	2001	low
5	1000	2001	medium
6	1500	2001	high
7	10000	2002	low
8	15000	2002	medium
9	20000	2002	high

I struggle to find the proper way to do this and would be very thankful for any suggestions!

CodePudding user response：

This is your first question, so I will make an exception, but stackoverflow community generally asks for something to work with. In other words, we need at least some code. Otherwise, you just ask other to do work for you. This should work in your case.

df$income_cat=as.factor(ifelse(df$Income<quantile(df$Income,0.33), 'low', 
                        ifelse(df$Income<quantile(df$Income,0.66), 'medium', 'high')))

Please tell me if it does or doesn't work.

CodePudding user response：

library(tidyverse) 

df %>%  
  group_by(Year) %>% 
  mutate(income_cat = case_when(Income > quantile(Income, 0.66) ~ "High", 
                                Income < quantile(Income, 0.33) ~ "Low", 
                                between(Income, 
                                        quantile(Income, 0.33),
                                        quantile(Income, 0.66)) ~ "Medium"))

# A tibble: 9 x 4
# Groups:   Year [3]
     ID Income  Year income_cat
  <dbl>  <dbl> <dbl> <chr>     
1     1    100  2000 Low       
2     2    200  2000 Medium    
3     3    300  2000 High      
4     4    500  2001 Low       
5     5   1000  2001 Medium    
6     6   1500  2001 High      
7     7  10000  2002 Low       
8     8  15000  2002 Medium    
9     9  20000  2002 High

CodePudding user response：

Here's an answer using data.table. Starting with toy data:

df <- data.table(id = 1:9, income = 100 100*(1:9), year = rep(2000 1:3, each = 3))

   id income year
1:  1    200 2001
2:  2    300 2001
3:  3    400 2001
4:  4    500 2002
5:  5    600 2002
6:  6    700 2002
7:  7    800 2003
8:  8    900 2003
9:  9   1000 2003

While grouping by year, we can use data.table::fcase and the quantile function included in base R:

library(data.table)
setDT(df)

df[, income_cat := fcase(income < quantile(income, 0.33), "low",
                         income < quantile(income, 0.66), "mid",
                         default = "high"), 
   by = year]

df
   id income year income_cat
1:  1    200 2001        low
2:  2    300 2001        mid
3:  3    400 2001       high
4:  4    500 2002        low
5:  5    600 2002        mid
6:  6    700 2002       high
7:  7    800 2003        low
8:  8    900 2003        mid
9:  9   1000 2003       high

CodePudding user response：

library(data.table)
library(magrittr)

df <- data.table(id = 1:9, income = 100 100*(1:9), year = rep(2000 1:3, each = 3))

df[, res := cut(
  x = income, 
  breaks = c(-Inf, quantile(x = income, probs = seq(0, 1, 1 /3))[2:3],  Inf),
  labels = c("Low", "Medium", "High")), by = year] %>%
  .[]
#>    id income year    res
#> 1:  1    200 2001    Low
#> 2:  2    300 2001 Medium
#> 3:  3    400 2001   High
#> 4:  4    500 2002    Low
#> 5:  5    600 2002 Medium
#> 6:  6    700 2002   High
#> 7:  7    800 2003    Low
#> 8:  8    900 2003 Medium
#> 9:  9   1000 2003   High

^{Created on 2022-11-30 with reprex v2.0.2}

CodePudding user response：

I understand that year is a grouping factor. Here is a version with findInterval. I have tried to recreate the idea of your dataframe and added some more random data points to each group - see data at the end.

library(dplyr)

df %>%
  group_by(year) %>%
  mutate(quantile = findInterval(income,
        quantile(income, probs=c(0.3, .66)))) |> 
  mutate(quantile = factor(quantile, labels = c("low", "medium", "high")))
#> # A tibble: 30 × 3
#> # Groups:   year [3]
#>    income  year quantile
#>     <int> <dbl> <fct>   
#>  1    258  2000 medium  
#>  2    278  2000 high    
#>  3    113  2000 low     
#>  4    294  2000 high    
#>  5    269  2000 medium  
#>  6    149  2000 low     
#>  7    217  2000 medium  
#>  8    142  2000 low     
#>  9    298  2000 high    
#> 10    297  2000 high    
#> # … with 20 more rows

Data

set.seed(123)
income <- c(sample(100:300, 10),
            sample(500:1500,10),
            sample(10000:20000, 10))
year <- c(rep(2000,10), rep(2001,10), rep(2002,10))

df <- data.frame(income, year)