In R, how do you make a new row that is the combination of two other rows without removing the origi-CodePudding

I have a longitudinal dataset about individuals from different socioeconomic backgrounds. The raw data is broken up into high, middle, lower middle, and lower SES statuses. However, I want to add a fifth row that aggregates the lower middle and lower SES statuses. I know how to get the information that I need as columns (see below), but I'm not sure how to elegantly get that information into another row.

Here's a subset of my raw dataset:

library(dplyr)

test_data <- tibble(month = c(rep(c("Jan"), 4), rep(c("Feb"), 4)),
                    ses = c(rep(c("High", "Mid", "Mid Low", "Low"), 2)),
                    total = c(10, 20, 20, 30, 9, 11, 40, 60),
                    total_selected = c(9, 10, 8, 3, 8, 6, 8, 6)) %>%
  group_by(month, ses) %>%
                    mutate(success_rate = total_selected/total)

And here's my code that does get the information that I need (i.e., it aggregates the information for lower and lower middle ses), but it puts them as columns instead of rows:

(test_data2 <- test_data %>%
  group_by(month) %>%
  mutate(three_ses_total = case_when(
    ses %in% c("High", "Mid") ~ total,
    ses %in% c("Mid Low", "Low") ~ (total[ses == "Mid Low"]   total[ses == "Low"])
  ),
  three_ses_total_selected = case_when(
    ses %in% c("High", "Mid") ~ total_selected,
    ses %in% c("Mid Low", "Low") ~ (total_selected[ses == "Mid Low"]   total_selected[ses == "Low"])
  ),
  three_ses_success_rate = case_when(
    ses %in% c("High", "Mid") ~ success_rate,
    ses %in% c("Mid Low", "Low") ~ three_ses_total_selected/three_ses_total
  )))

Last, this is what I want the output to look like. Note: I want 5 rows--in other words, I still want the 4 raw classes in the dataset, but I also want the new combined lower middle and lower:

(answer <- tibble(month = c(rep(c("Jan"), 5), rep(c("Feb"), 5)),
                    ses = c(rep(c("High", "Mid", "Mid Low", "Low", "Mid Low and Low"), 2)),
                    total = c(10, 20, 20, 30, 50, 9, 11, 40, 60, 100),
                    total_selected = c(9, 10, 8, 3, 11, 8, 6, 8, 6, 14)) %>%
  group_by(month, ses) %>%
  mutate(success_rate = total_selected/total))

I'm open to any suggestion, but if there's a dplyr, tidyr, or other tidyverse function(s) that could help, I'd especially appreciate that. I was trying to think if tidyr's pivot functions would work, but I can't seem to crack it.

CodePudding user response：

Try this:

library(dplyr)
test_data %>%
  filter(ses %in% c("Low", "Mid Low")) %>%
  group_by(month) %>%
  summarize(
    ses = "Mid Low and Low",
    across(-c(ses, succes_rate), sum),
    succes_rate = total_selected / total
  ) %>%
  bind_rows(test_data) %>%
  arrange(month, ses)
# # A tibble: 10 x 5
#    month ses             total total_selected succes_rate
#    <chr> <chr>           <dbl>          <dbl>       <dbl>
#  1 Feb   High                9              8       0.889
#  2 Feb   Low                60              6       0.1  
#  3 Feb   Mid                11              6       0.545
#  4 Feb   Mid Low            40              8       0.2  
#  5 Feb   Mid Low and Low   100             14       0.14 
#  6 Jan   High               10              9       0.9  
#  7 Jan   Low                30              3       0.1  
#  8 Jan   Mid                20             10       0.5  
#  9 Jan   Mid Low            20              8       0.4  
# 10 Jan   Mid Low and Low    50             11       0.22

The intent of this is to produce the additional rows first (and separately), which in this case produces just two rows:

test_data %>%
  filter(ses %in% c("Low", "Mid Low")) %>%
  group_by(month) %>%
  summarize(
    ses = "Mid Low and Low",
    across(-c(ses, succes_rate), sum),
    succes_rate = total_selected / total
  )
# # A tibble: 2 x 5
#   month ses             total total_selected succes_rate
#   <chr> <chr>           <dbl>          <dbl>       <dbl>
# 1 Feb   Mid Low and Low   100             14        0.14
# 2 Jan   Mid Low and Low    50             11        0.22

One we have those two, add them to the original data with %>% bind_rows(test_data). (I added the arrange since the months would be out of order.)

CodePudding user response：

It's a little hacky, but this solution works.

First, get the values that you want by combining them together as I did above, but instead of making new column names, keep the original names:

library(dplyr)

(test_data2 <- test_data %>%
  group_by(month) %>%
  mutate(total = case_when(
    ses %in% c("High", "Mid") ~ total,
    ses %in% c("Mid Low", "Low") ~ (total[ses == "Mid Low"]   total[ses == "Low"])
  ),
  total_selected = case_when(
    ses %in% c("High", "Mid") ~ total_selected,
    ses %in% c("Mid Low", "Low") ~ (total_selected[ses == "Mid Low"]   total_selected[ses == "Low"])
  ),
  success_rate = case_when(
    ses %in% c("High", "Mid") ~ success_rate,
    ses %in% c("Mid Low", "Low") ~ total_selected/total
  ))

Then, filter down to just one of the duplicated rows and change the name to what you want:

(test_data2 <- test_data2  %>%
  filter(ses == "Low") %>%
  mutate(ses = "Mid Low and Low"))

Last, full_join it with your original data:

(test_data3 <- test_data %>%
    full_join(test_data2))

Still open to more parsimonious options, but this works!

CodePudding user response：

Here is a solution that combines group_modify and adorn_totals from janitor package:

library(janitor)
library(dplyr)

df %>% 
  filter(ses == "Mid Low" | ses == "Low") %>% 
  group_by(month) %>% 
  group_modify(~ .x %>%
              adorn_totals("row")) %>% 
  filter(ses == "Total") %>% 
  mutate(succes_rate = total_selected/total)

  month ses   total total_selected succes_rate
  <chr> <chr> <dbl>          <dbl>       <dbl>
1 Feb   Total   100             14        0.14
2 Jan   Total    50             11        0.22

CodePudding user response：

An option with add_row

library(dplyr)
test_data %>% 
 group_by(month) %>% 
 group_modify(~ add_row(.x, ses = "Mid Low and Low", 
     !!! colSums(.x[.x$ses %in% c("Mid Low", "Low"),
     c("total", "total_selected")]))) %>%
 ungroup %>% 
 mutate(success_rate = coalesce(success_rate, total_selected/total))

-output

# A tibble: 10 × 5
   month ses             total total_selected success_rate
   <chr> <chr>           <dbl>          <dbl>        <dbl>
 1 Feb   High                9              8        0.889
 2 Feb   Mid                11              6        0.545
 3 Feb   Mid Low            40              8        0.2  
 4 Feb   Low                60              6        0.1  
 5 Feb   Mid Low and Low   100             14        0.14 
 6 Jan   High               10              9        0.9  
 7 Jan   Mid                20             10        0.5  
 8 Jan   Mid Low            20              8        0.4  
 9 Jan   Low                30              3        0.1  
10 Jan   Mid Low and Low    50             11        0.22

Or in data.table

library(data.table)
setDT(test_data)[, rbind(.SD, c(list(ses = "Mid Low and Low"), 
   lapply(.SD[ses %in% c("Mid Low", "Low"),
  .(total, total_selected)], sum)), fill = TRUE), month][, 
  success_rate := fcoalesce(success_rate, total_selected/total)][]
 month             ses total total_selected success_rate
    <char>          <char> <num>          <num>        <num>
 1:    Jan            High    10              9    0.9000000
 2:    Jan             Mid    20             10    0.5000000
 3:    Jan         Mid Low    20              8    0.4000000
 4:    Jan             Low    30              3    0.1000000
 5:    Jan Mid Low and Low    50             11    0.2200000
 6:    Feb            High     9              8    0.8888889
 7:    Feb             Mid    11              6    0.5454545
 8:    Feb         Mid Low    40              8    0.2000000
 9:    Feb             Low    60              6    0.1000000
10:    Feb Mid Low and Low   100             14    0.1400000