I have a dataset with staff information. I have a column that lists their current age and a column that lists their salary. I want to create an R data frame that has 3 columns: one to show all the unique ages, one to count the number of people who are that age and one to give me the median salary for each particular age. On top of this, I would like to group those who are under 21 and over 65. Ideally it would look like this:
age | number of people | median salary |
---|---|---|
Under 21 | 36 | 26,300 |
22 | 15 | 26,300 |
23 | 30 | 27,020 |
24 | 41 | 26,300 |
etc | ||
Over65 | 47 | 39,100 |
The current dataset has hundreds of columns and thousands of rows but the columns that are of interest are like this:
ageyears | sal22 |
---|---|
46 | 28,250 |
32 | 26,300 |
19 | 27,020 |
24 | 26,300 |
53 | 36,105 |
47 | 39,100 |
47 | 26,200 |
70 | 69,500 |
68 | 75,310 |
I'm a bit lost on the best way to do this but assume some sort of loop would work best? Thanks so much for any direction or help.
CodePudding user response:
library(tidyverse)
sample_data <- tibble(
age = sample(17:70, 100, replace = TRUE) %>% as.character(),
salary = sample(20000:90000, 100, replace = TRUE)
)
# A tibble: 100 × 2
age salary
<chr> <int>
1 56 35130
2 56 44203
3 20 28701
4 47 66564
5 66 60823
6 54 36755
7 66 30731
8 68 21338
9 19 80875
10 61 44547
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
sample_data %>%
mutate(age = case_when(age <= 21 ~ "Under 21",
age >= 65 ~ "Over 65",
TRUE ~ age)) %>%
group_by(age) %>%
summarise(count = n(),
median_salary = median(salary))
# A tibble: 38 × 3
age count median_salary
<chr> <int> <dbl>
1 22 4 46284.
2 23 3 55171
3 25 3 74545
4 27 1 37052
5 28 3 66006
6 29 1 82877
7 30 2 40342.
8 31 2 27815
9 32 1 32282
10 33 3 64523
# … with 28 more rows
# ℹ Use `print(n = ...)` to see more rows