Home > Mobile >  R dplyr - Summarize from two distinct columns containing the same value (single step)
R dplyr - Summarize from two distinct columns containing the same value (single step)

Time:10-21

Sample Dataframe:

      ID RNG_BEFORE_WDL RNG_AFTER_WDL
     <int> <chr>          <chr>        
 1 3490115 Less than 10K  Less than 10K
 2 3564671 Less than 1K   0            
 3 3914214 Less than 30K  0            
 4 3971472 More than 60K  More than 60K
 5 3971472 More than 60K  More than 60K
 6 4138130 Less than 1K   0            
 7 4143893 Less than 10K  Less than 10K
 8 4145280 Less than 10K  Less than 100
 9 4146908 Less than 30K  Less than 30K
10 4146929 Less than 10K  Less than 1K 
11 4146958 Less than 30K  Less than 10K
12 4147813 Less than 10K  Less than 1K 
13 4148128 Less than 1K   0            
14 4148446 Less than 60K  Less than 60K
15 4148446 Less than 60K  Less than 60K

2 columns (RNG_BEFORE_WDL & RNG_AFTER_WDL) has different name but share unique value:

[1] "Less than 10K" "Less than 1K"  "Less than 30K" "More than 60K"
[5] "Less than 60K" "Less than 100"

Expected outcome:

  SV_RNG         BF_WDL  AF_WDL
  <chr>           <int>   <int>
1 0                  NA  926339
2 Less than 100   15165   78106
3 Less than 1K   289663  588669
4 Less than 10K 1225950  558488
5 Less than 30K  848153  469356
6 Less than 60K  464382  354119
7 More than 60K 4531080 4399316

Two Step Solution:

# CREATE SEPARATE ENTITY_SET
es_01 <- df |> distinct(ID, .keep_all = TRUE) |> group_by(RNG_BEFORE_WDL)|> summarise(BF_WDL = n()) |> rename(SV_RNG = RNG_BEFORE_WDL)
es_02 <- df |> distinct(ID, .keep_all = TRUE) |> group_by(RNG_AFTER_WDL) |> summarise(AF_WDL = n()) |> rename(SV_RNG = RNG_AFTER_WDL)
  
# PRE-DEFINE CUSTOM SORT ORDER
ls_odr <- list("0", "Less than 100", "Less than 1K", "Less than 10K", "Less than 30K", "Less than 60K", "More than 60K") 

# SUMMARIZE ENTITY_SET
s_es00 <- es_01 |> full_join(es_02) |> arrange(`SV_RNG` |> factor(ls_odr))

How to do in a single step with R dplyr?

Thanks!

CodePudding user response:

If we need two datasets or keep it in a list, loop over the grouping column name as a string, and the the summarisation column

library(dplyr)
library(purrr)
lst1 <- map2(c("RNG_BEFORE_WDL", "RNG_AFTER_WD"), c("BF_WDL", "AF_WDL"),
     ~   df %>%
           distinct(ID, .keep_all = TRUE) %>%
           group_by(across(all_of(.x))) %>%
           summarise(!! .y := n(), .groups = 'drop') %>%
           rename(SV_RNG = 1)) %>%
     reduce(full_join)
  • Related