I have a longitudinal dataset about individuals from different socioeconomic backgrounds. The raw data is broken up into high, middle, lower middle, and lower SES statuses. However, I want to add a fifth row that aggregates the lower middle and lower SES statuses. I know how to get the information that I need as columns (see below), but I'm not sure how to elegantly get that information into another row.
Here's a subset of my raw dataset:
library(dplyr)
test_data <- tibble(month = c(rep(c("Jan"), 4), rep(c("Feb"), 4)),
ses = c(rep(c("High", "Mid", "Mid Low", "Low"), 2)),
total = c(10, 20, 20, 30, 9, 11, 40, 60),
total_selected = c(9, 10, 8, 3, 8, 6, 8, 6)) %>%
group_by(month, ses) %>%
mutate(success_rate = total_selected/total)
And here's my code that does get the information that I need (i.e., it aggregates the information for lower and lower middle ses), but it puts them as columns instead of rows:
(test_data2 <- test_data %>%
group_by(month) %>%
mutate(three_ses_total = case_when(
ses %in% c("High", "Mid") ~ total,
ses %in% c("Mid Low", "Low") ~ (total[ses == "Mid Low"] total[ses == "Low"])
),
three_ses_total_selected = case_when(
ses %in% c("High", "Mid") ~ total_selected,
ses %in% c("Mid Low", "Low") ~ (total_selected[ses == "Mid Low"] total_selected[ses == "Low"])
),
three_ses_success_rate = case_when(
ses %in% c("High", "Mid") ~ success_rate,
ses %in% c("Mid Low", "Low") ~ three_ses_total_selected/three_ses_total
)))
Last, this is what I want the output to look like. Note: I want 5 rows--in other words, I still want the 4 raw classes in the dataset, but I also want the new combined lower middle and lower:
(answer <- tibble(month = c(rep(c("Jan"), 5), rep(c("Feb"), 5)),
ses = c(rep(c("High", "Mid", "Mid Low", "Low", "Mid Low and Low"), 2)),
total = c(10, 20, 20, 30, 50, 9, 11, 40, 60, 100),
total_selected = c(9, 10, 8, 3, 11, 8, 6, 8, 6, 14)) %>%
group_by(month, ses) %>%
mutate(success_rate = total_selected/total))
I'm open to any suggestion, but if there's a dplyr, tidyr, or other tidyverse function(s) that could help, I'd especially appreciate that. I was trying to think if tidyr's pivot functions would work, but I can't seem to crack it.
CodePudding user response:
Try this:
library(dplyr)
test_data %>%
filter(ses %in% c("Low", "Mid Low")) %>%
group_by(month) %>%
summarize(
ses = "Mid Low and Low",
across(-c(ses, succes_rate), sum),
succes_rate = total_selected / total
) %>%
bind_rows(test_data) %>%
arrange(month, ses)
# # A tibble: 10 x 5
# month ses total total_selected succes_rate
# <chr> <chr> <dbl> <dbl> <dbl>
# 1 Feb High 9 8 0.889
# 2 Feb Low 60 6 0.1
# 3 Feb Mid 11 6 0.545
# 4 Feb Mid Low 40 8 0.2
# 5 Feb Mid Low and Low 100 14 0.14
# 6 Jan High 10 9 0.9
# 7 Jan Low 30 3 0.1
# 8 Jan Mid 20 10 0.5
# 9 Jan Mid Low 20 8 0.4
# 10 Jan Mid Low and Low 50 11 0.22
The intent of this is to produce the additional rows first (and separately), which in this case produces just two rows:
test_data %>%
filter(ses %in% c("Low", "Mid Low")) %>%
group_by(month) %>%
summarize(
ses = "Mid Low and Low",
across(-c(ses, succes_rate), sum),
succes_rate = total_selected / total
)
# # A tibble: 2 x 5
# month ses total total_selected succes_rate
# <chr> <chr> <dbl> <dbl> <dbl>
# 1 Feb Mid Low and Low 100 14 0.14
# 2 Jan Mid Low and Low 50 11 0.22
One we have those two, add them to the original data with %>% bind_rows(test_data)
. (I added the arrange
since the months would be out of order.)
CodePudding user response:
It's a little hacky, but this solution works.
First, get the values that you want by combining them together as I did above, but instead of making new column names, keep the original names:
library(dplyr)
(test_data2 <- test_data %>%
group_by(month) %>%
mutate(total = case_when(
ses %in% c("High", "Mid") ~ total,
ses %in% c("Mid Low", "Low") ~ (total[ses == "Mid Low"] total[ses == "Low"])
),
total_selected = case_when(
ses %in% c("High", "Mid") ~ total_selected,
ses %in% c("Mid Low", "Low") ~ (total_selected[ses == "Mid Low"] total_selected[ses == "Low"])
),
success_rate = case_when(
ses %in% c("High", "Mid") ~ success_rate,
ses %in% c("Mid Low", "Low") ~ total_selected/total
))
Then, filter down to just one of the duplicated rows and change the name to what you want:
(test_data2 <- test_data2 %>%
filter(ses == "Low") %>%
mutate(ses = "Mid Low and Low"))
Last, full_join
it with your original data:
(test_data3 <- test_data %>%
full_join(test_data2))
Still open to more parsimonious options, but this works!
CodePudding user response:
Here is a solution that combines group_modify
and adorn_totals
from janitor
package:
library(janitor)
library(dplyr)
df %>%
filter(ses == "Mid Low" | ses == "Low") %>%
group_by(month) %>%
group_modify(~ .x %>%
adorn_totals("row")) %>%
filter(ses == "Total") %>%
mutate(succes_rate = total_selected/total)
month ses total total_selected succes_rate
<chr> <chr> <dbl> <dbl> <dbl>
1 Feb Total 100 14 0.14
2 Jan Total 50 11 0.22
CodePudding user response:
An option with add_row
library(dplyr)
test_data %>%
group_by(month) %>%
group_modify(~ add_row(.x, ses = "Mid Low and Low",
!!! colSums(.x[.x$ses %in% c("Mid Low", "Low"),
c("total", "total_selected")]))) %>%
ungroup %>%
mutate(success_rate = coalesce(success_rate, total_selected/total))
-output
# A tibble: 10 × 5
month ses total total_selected success_rate
<chr> <chr> <dbl> <dbl> <dbl>
1 Feb High 9 8 0.889
2 Feb Mid 11 6 0.545
3 Feb Mid Low 40 8 0.2
4 Feb Low 60 6 0.1
5 Feb Mid Low and Low 100 14 0.14
6 Jan High 10 9 0.9
7 Jan Mid 20 10 0.5
8 Jan Mid Low 20 8 0.4
9 Jan Low 30 3 0.1
10 Jan Mid Low and Low 50 11 0.22
Or in data.table
library(data.table)
setDT(test_data)[, rbind(.SD, c(list(ses = "Mid Low and Low"),
lapply(.SD[ses %in% c("Mid Low", "Low"),
.(total, total_selected)], sum)), fill = TRUE), month][,
success_rate := fcoalesce(success_rate, total_selected/total)][]
month ses total total_selected success_rate
<char> <char> <num> <num> <num>
1: Jan High 10 9 0.9000000
2: Jan Mid 20 10 0.5000000
3: Jan Mid Low 20 8 0.4000000
4: Jan Low 30 3 0.1000000
5: Jan Mid Low and Low 50 11 0.2200000
6: Feb High 9 8 0.8888889
7: Feb Mid 11 6 0.5454545
8: Feb Mid Low 40 8 0.2000000
9: Feb Low 60 6 0.1000000
10: Feb Mid Low and Low 100 14 0.1400000