I have a dataframe with the following format:
ID | Income | Year |
---|---|---|
1 | 100 | 2000 |
2 | 200 | 2000 |
3 | 300 | 2000 |
4 | 500 | 2001 |
5 | 1000 | 2001 |
6 | 1500 | 2001 |
7 | 10000 | 2002 |
8 | 15000 | 2002 |
9 | 20000 | 2002 |
I'd like to add a column called income_cat with three possible levels; "low", "medium" and "high" depending on whether the income is in the lower 33th percentile, the middle 33th percentile or the top 33th percentile of that specific year.
ID | Income | Year | income_cat |
---|---|---|---|
1 | 100 | 2000 | low |
2 | 200 | 2000 | medium |
3 | 300 | 2000 | high |
4 | 500 | 2001 | low |
5 | 1000 | 2001 | medium |
6 | 1500 | 2001 | high |
7 | 10000 | 2002 | low |
8 | 15000 | 2002 | medium |
9 | 20000 | 2002 | high |
I struggle to find the proper way to do this and would be very thankful for any suggestions!
CodePudding user response:
This is your first question, so I will make an exception, but stackoverflow community generally asks for something to work with. In other words, we need at least some code. Otherwise, you just ask other to do work for you. This should work in your case.
df$income_cat=as.factor(ifelse(df$Income<quantile(df$Income,0.33), 'low',
ifelse(df$Income<quantile(df$Income,0.66), 'medium', 'high')))
Please tell me if it does or doesn't work.
CodePudding user response:
library(tidyverse)
df %>%
group_by(Year) %>%
mutate(income_cat = case_when(Income > quantile(Income, 0.66) ~ "High",
Income < quantile(Income, 0.33) ~ "Low",
between(Income,
quantile(Income, 0.33),
quantile(Income, 0.66)) ~ "Medium"))
# A tibble: 9 x 4
# Groups: Year [3]
ID Income Year income_cat
<dbl> <dbl> <dbl> <chr>
1 1 100 2000 Low
2 2 200 2000 Medium
3 3 300 2000 High
4 4 500 2001 Low
5 5 1000 2001 Medium
6 6 1500 2001 High
7 7 10000 2002 Low
8 8 15000 2002 Medium
9 9 20000 2002 High
CodePudding user response:
Here's an answer using data.table
. Starting with toy data:
df <- data.table(id = 1:9, income = 100 100*(1:9), year = rep(2000 1:3, each = 3))
id income year
1: 1 200 2001
2: 2 300 2001
3: 3 400 2001
4: 4 500 2002
5: 5 600 2002
6: 6 700 2002
7: 7 800 2003
8: 8 900 2003
9: 9 1000 2003
While grouping by
year
, we can use data.table::fcase
and the quantile
function included in base R:
library(data.table)
setDT(df)
df[, income_cat := fcase(income < quantile(income, 0.33), "low",
income < quantile(income, 0.66), "mid",
default = "high"),
by = year]
df
id income year income_cat
1: 1 200 2001 low
2: 2 300 2001 mid
3: 3 400 2001 high
4: 4 500 2002 low
5: 5 600 2002 mid
6: 6 700 2002 high
7: 7 800 2003 low
8: 8 900 2003 mid
9: 9 1000 2003 high
CodePudding user response:
library(data.table)
library(magrittr)
df <- data.table(id = 1:9, income = 100 100*(1:9), year = rep(2000 1:3, each = 3))
df[, res := cut(
x = income,
breaks = c(-Inf, quantile(x = income, probs = seq(0, 1, 1 /3))[2:3], Inf),
labels = c("Low", "Medium", "High")), by = year] %>%
.[]
#> id income year res
#> 1: 1 200 2001 Low
#> 2: 2 300 2001 Medium
#> 3: 3 400 2001 High
#> 4: 4 500 2002 Low
#> 5: 5 600 2002 Medium
#> 6: 6 700 2002 High
#> 7: 7 800 2003 Low
#> 8: 8 900 2003 Medium
#> 9: 9 1000 2003 High
Created on 2022-11-30 with reprex v2.0.2
CodePudding user response:
I understand that year
is a grouping factor.
Here is a version with findInterval
.
I have tried to recreate the idea of your dataframe and added some more random data points to each group - see data at the end.
library(dplyr)
df %>%
group_by(year) %>%
mutate(quantile = findInterval(income,
quantile(income, probs=c(0.3, .66)))) |>
mutate(quantile = factor(quantile, labels = c("low", "medium", "high")))
#> # A tibble: 30 × 3
#> # Groups: year [3]
#> income year quantile
#> <int> <dbl> <fct>
#> 1 258 2000 medium
#> 2 278 2000 high
#> 3 113 2000 low
#> 4 294 2000 high
#> 5 269 2000 medium
#> 6 149 2000 low
#> 7 217 2000 medium
#> 8 142 2000 low
#> 9 298 2000 high
#> 10 297 2000 high
#> # … with 20 more rows
Data
set.seed(123)
income <- c(sample(100:300, 10),
sample(500:1500,10),
sample(10000:20000, 10))
year <- c(rep(2000,10), rep(2001,10), rep(2002,10))
df <- data.frame(income, year)