How to summarize data by-group, by creating dummy variables as the collapsing method-CodePudding

The bounty expires in 7 days. Answers to this question are eligible for a 50 reputation bounty. Emman wants to draw more attention to this question:

I'm looking for a speedy and efficient solution for the problem. While @JonSpring's answer works for the example data, it leads to error when applying on my real data. Specifically, tidyr::count() leads to the error of going beyond .Machine$integer.max == 2147483647. So I'll be happy with a different strategy that circumvents this issue.

I'm trying to summarize a dataset by groups, to have dummy columns for whether each group's values appear among the data's ungrouped most frequent values.

As an example, let's take flights data from nycflights13.

library(dplyr, warn.conflicts = FALSE)
library(nycflights13)

my_flights_raw <-
  flights %>%
  select(carrier, month, dest)

my_flights_raw
#> # A tibble: 336,776 x 3
#>    carrier month dest 
#>    <chr>   <int> <chr>
#>  1 UA          1 IAH  
#>  2 UA          1 IAH  
#>  3 AA          1 MIA  
#>  4 B6          1 BQN  
#>  5 DL          1 ATL  
#>  6 UA          1 ORD  
#>  7 B6          1 FLL  
#>  8 EV          1 IAD  
#>  9 B6          1 MCO  
#> 10 AA          1 ORD  
#> # ... with 336,766 more rows

My end-goal: I'm interested to know about each carrier in each month: whether it flew to the most popular destinations. I define "most popular" by the top-5 most frequent dest values in each month, then intersecting all months' top-5s.

step 1
I start by simple aggregation by months:

my_flights_agg <- 
  my_flights_raw %>%
  count(month, dest, name = "n_obs") %>%
  arrange(month, desc(n_obs)) 

my_flights_agg
#> # A tibble: 1,113 x 3
#>    month dest  n_obs
#>    <int> <chr> <int>
#>  1     1 ATL    1396
#>  2     1 ORD    1269
#>  3     1 BOS    1245
#>  4     1 MCO    1175
#>  5     1 FLL    1161
#>  6     1 LAX    1159
#>  7     1 CLT    1058
#>  8     1 MIA     981
#>  9     1 SFO     889
#> 10     1 DCA     865
#> # ... with 1,103 more rows

step 2
And now I'm going to cut the data to keep only the top 5 most popular per month.

my_flights_top_5_by_month <-
  my_flights_agg %>%
  group_by(month) %>%
  slice_max(order_by = n_obs, n = 5)

my_flights_top_5_by_month
#> # A tibble: 60 x 3
#> # Groups:   month [12]
#>    month dest  n_obs
#>    <int> <chr> <int>
#>  1     1 ATL    1396
#>  2     1 ORD    1269
#>  3     1 BOS    1245
#>  4     1 MCO    1175
#>  5     1 FLL    1161
#>  6     2 ATL    1267
#>  7     2 ORD    1197
#>  8     2 BOS    1182
#>  9     2 MCO    1110
#> 10     2 FLL    1073
#> # ... with 50 more rows

step 3
Now simply get the unique() of my_flights_top_5_by_month$dest:

my_flights_top_dest_across_months <- unique(my_flights_top_5_by_month$dest)

## [1] "ATL" "ORD" "BOS" "MCO" "FLL" "LAX" "SFO" "CLT"

Here's my question: given my_flights_top_dest_across_months, how can I summarize my_flights_raw to distinct carrier & month, such that the collapsing principle is whether each combination of carrier & month had flawn to each of the dest values in my_flights_top_dest_across_months?

desired output

##    carrier month ATL   ORD   BOS   MCO   FLL   LAX   SFO   CLT  
##    <chr>   <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 9E          1 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  2 9E          2 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  3 9E          3 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  4 9E          4 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  5 9E          5 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  6 9E          6 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  7 9E          7 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  8 9E          8 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
##  9 9E          9 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## 10 9E         10 FALSE TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE 
## # ... with 175 more rows

I currently have the following code that is simply inefficient. It works fine for the example flights data, but is taking forever when applied on a large dataset (with several millions rows and groups). Any idea how the task described above can be done more efficiently?

# too slow :(
my_flights_raw %>%
  group_by(carrier, month) %>%
  summarise(destinations_vec = list(unique(dest))) %>%
  add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
  mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
  mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month))  )) %>%
  tidyr::unnest_wider(are_top_dest_included)

CodePudding user response：

It is quite possible that using the data.table library will be faster here. I will not argue. But I have mastered dplyr and would like to offer a pretty cool solution using the functions from this particular library.

First, let's prepare two little auxiliary functions. We will see how they work later.

library(nycflights13)
library(tidyverse)


ftopDest = function(data, ntop){
  data %>% 
    group_by(dest) %>% 
    summarise(ndest = n()) %>% 
    arrange(desc(ndest)) %>% 
    pull(dest) %>% .[1:ntop]
}

carrierToTopDest = function(data, topDest){
  data %>% mutate(carrierToToDest = dest %in% topDest)
}

Now you only need one simple mutation!

df = flights %>% nest_by(year, month) %>%  #Step 1
  mutate(topDest = list(ftopDest(data, 5)),  #Step 2
         data = list(carrierToTopDest(data, topDest)))  #Step 3

But let me describe step by step what is happening here.

In step one, let's nest the data into an internal tibble named data.

Output after Step 1

# A tibble: 12 x 3
# Rowwise:  year, month
    year month                data
   <int> <int> <list<tibble[,17]>>
 1  2013     1       [27,004 x 17]
 2  2013     2       [24,951 x 17]
 3  2013     3       [28,834 x 17]
 4  2013     4       [28,330 x 17]
 5  2013     5       [28,796 x 17]
 6  2013     6       [28,243 x 17]
 7  2013     7       [29,425 x 17]
 8  2013     8       [29,327 x 17]
 9  2013     9       [27,574 x 17]
10  2013    10       [28,889 x 17]
11  2013    11       [27,268 x 17]
12  2013    12       [28,135 x 17]

In step 2, we add the most popular flight destinations.

Output after step 2

# A tibble: 12 x 4
# Rowwise:  year, month
    year month                data topDest  
   <int> <int> <list<tibble[,17]>> <list>   
 1  2013     1       [27,004 x 17] <chr [5]>
 2  2013     2       [24,951 x 17] <chr [5]>
 3  2013     3       [28,834 x 17] <chr [5]>
 4  2013     4       [28,330 x 17] <chr [5]>
 5  2013     5       [28,796 x 17] <chr [5]>
 6  2013     6       [28,243 x 17] <chr [5]>
 7  2013     7       [29,425 x 17] <chr [5]>
 8  2013     8       [29,327 x 17] <chr [5]>
 9  2013     9       [27,574 x 17] <chr [5]>
10  2013    10       [28,889 x 17] <chr [5]>
11  2013    11       [27,268 x 17] <chr [5]>
12  2013    12       [28,135 x 17] <chr [5]>

In the last step, we add the carrierToToDest variable to the data variable, which determines whether the flight was going to one of the ntop places from the given month.

Output after step 3

# A tibble: 12 x 4
# Rowwise:  year, month
    year month data                   topDest  
   <int> <int> <list>                 <list>   
 1  2013     1 <tibble [27,004 x 18]> <chr [5]>
 2  2013     2 <tibble [24,951 x 18]> <chr [5]>
 3  2013     3 <tibble [28,834 x 18]> <chr [5]>
 4  2013     4 <tibble [28,330 x 18]> <chr [5]>
 5  2013     5 <tibble [28,796 x 18]> <chr [5]>
 6  2013     6 <tibble [28,243 x 18]> <chr [5]>
 7  2013     7 <tibble [29,425 x 18]> <chr [5]>
 8  2013     8 <tibble [29,327 x 18]> <chr [5]>
 9  2013     9 <tibble [27,574 x 18]> <chr [5]>
10  2013    10 <tibble [28,889 x 18]> <chr [5]>
11  2013    11 <tibble [27,268 x 18]> <chr [5]>
12  2013    12 <tibble [28,135 x 18]> <chr [5]>

How now we can see the most popular places. Let's do this:

df %>% mutate(topDest = paste(topDest, collapse = " "))

output

# A tibble: 12 x 4
# Rowwise:  year, month
    year month data                   topDest            
   <int> <int> <list>                 <chr>              
 1  2013     1 <tibble [27,004 x 18]> ATL ORD BOS MCO FLL
 2  2013     2 <tibble [24,951 x 18]> ATL ORD BOS MCO FLL
 3  2013     3 <tibble [28,834 x 18]> ATL ORD BOS MCO FLL
 4  2013     4 <tibble [28,330 x 18]> ATL ORD LAX BOS MCO
 5  2013     5 <tibble [28,796 x 18]> ORD ATL LAX BOS SFO
 6  2013     6 <tibble [28,243 x 18]> ORD ATL LAX BOS SFO
 7  2013     7 <tibble [29,425 x 18]> ORD ATL LAX BOS CLT
 8  2013     8 <tibble [29,327 x 18]> ORD ATL LAX BOS SFO
 9  2013     9 <tibble [27,574 x 18]> ORD LAX ATL BOS CLT
10  2013    10 <tibble [28,889 x 18]> ORD ATL LAX BOS CLT
11  2013    11 <tibble [27,268 x 18]> ATL ORD LAX BOS CLT
12  2013    12 <tibble [28,135 x 18]> ATL LAX MCO ORD CLT

Can we see flights to these destinations? Of course, it's not difficult.

df %>% select(-topDest) %>% 
  unnest(data) %>% 
  filter(carrierToToDest) %>% 
  select(year, month, flight, carrier, dest)

Output

# A tibble: 80,941 x 5
# Groups:   year, month [12]
    year month flight carrier dest 
   <int> <int>  <int> <chr>   <chr>
 1  2013     1    461 DL      ATL  
 2  2013     1   1696 UA      ORD  
 3  2013     1    507 B6      FLL  
 4  2013     1     79 B6      MCO  
 5  2013     1    301 AA      ORD  
 6  2013     1   1806 B6      BOS  
 7  2013     1    371 B6      FLL  
 8  2013     1   4650 MQ      ATL  
 9  2013     1   1743 DL      ATL  
10  2013     1   3768 MQ      ORD  
# ... with 80,931 more rows

This is my recipe. Very simple and transparent in my opinion. I would be extremely obligated if you would try it on your data and let me know with efficiency.

Small update

I just noticed that I wanted to group not only after year (although you don't mention it, it must be so), month, but also by the carrier variable. So let's add it as another grouping variable.

df = flights %>% nest_by(year, month, carrier) %>%  
  mutate(topDest = list(ftopDest(data, 5)),  
         data = list(carrierToTopDest(data, topDest)))

output

# A tibble: 185 x 5
# Rowwise:  year, month, carrier
    year month carrier data                  topDest  
   <int> <int> <chr>   <list>                <list>   
 1  2013     1 9E      <tibble [1,573 x 17]> <chr [5]>
 2  2013     1 AA      <tibble [2,794 x 17]> <chr [5]>
 3  2013     1 AS      <tibble [62 x 17]>    <chr [5]>
 4  2013     1 B6      <tibble [4,427 x 17]> <chr [5]>
 5  2013     1 DL      <tibble [3,690 x 17]> <chr [5]>
 6  2013     1 EV      <tibble [4,171 x 17]> <chr [5]>
 7  2013     1 F9      <tibble [59 x 17]>    <chr [5]>
 8  2013     1 FL      <tibble [328 x 17]>   <chr [5]>
 9  2013     1 HA      <tibble [31 x 17]>    <chr [5]>
10  2013     1 MQ      <tibble [2,271 x 17]> <chr [5]>
# ... with 175 more rows

Now let's get acquainted with the new top 5 directions.

df %>% mutate(topDest = paste(topDest, collapse = " "))

output

# A tibble: 185 x 5
# Rowwise:  year, month, carrier
    year month carrier data                  topDest            
   <int> <int> <chr>   <list>                <chr>              
 1  2013     1 9E      <tibble [1,573 x 17]> BOS PHL CVG MSP ORD
 2  2013     1 AA      <tibble [2,794 x 17]> DFW MIA ORD LAX BOS
 3  2013     1 AS      <tibble [62 x 17]>    SEA NA NA NA NA    
 4  2013     1 B6      <tibble [4,427 x 17]> FLL MCO BOS PBI SJU
 5  2013     1 DL      <tibble [3,690 x 17]> ATL DTW MCO FLL MIA
 6  2013     1 EV      <tibble [4,171 x 17]> IAD DTW DCA RDU CVG
 7  2013     1 F9      <tibble [59 x 17]>    DEN NA NA NA NA    
 8  2013     1 FL      <tibble [328 x 17]>   ATL CAK MKE NA NA  
 9  2013     1 HA      <tibble [31 x 17]>    HNL NA NA NA NA    
10  2013     1 MQ      <tibble [2,271 x 17]> RDU CMH ORD BNA ATL
# ... with 175 more rows

Summing up, I would like to add that the form is very clear for me. I can see the most popular df%>% mutate (topDest = paste (topDest, collapse =" ")) directions. I can filter all flights to the most popular destinations df%>% select (-topDest)%>% unnest (data)%>% filter (carrierToToDest)%>% select (year, month, flight, carrier, dest) and do any other transformations. I do not think that presenting the same information wider on over 100 variables is convenient for any analysis.

However, if you really need wider form, let me know. We'll do it this way.

CodePudding user response：

Does this do what you want? As far as I can tell it matches your output but has more rows because it includes all months for all carriers; carrier "OO" only has flights in 5 months and your version only shows those 5 months in the summary.

With the data as provided (336k rows), this takes a similar amount of time as your function, but it's faster as you deal with larger data. When I run these on data 100x as big after setting my_flights_raw <- my_flights_raw %>% tidyr::uncount(100), to make it 33M rows, the code below is about 40% faster.

Given the large number of groups you're dealing with, I expect this is a situation where data.table will really shine with better performance.

library(tidyverse)
my_flights_raw %>%
  count(carrier, month, dest) %>%
  complete(carrier, month, dest) %>%
  filter(dest %in% my_flights_top_dest_across_months) %>%
  mutate(n = if_else(!is.na(n), TRUE, FALSE)) %>%
  pivot_wider(names_from = dest, values_from = n)