I'm trying to summarize a dataset by groups, to have dummy columns for whether each group's values appear among the data's ungrouped most frequent values.
As an example, let's take flights
data from nycflights13
.
library(dplyr, warn.conflicts = FALSE)
library(nycflights13)
my_flights_raw <-
flights %>%
select(carrier, month, dest)
my_flights_raw
#> # A tibble: 336,776 x 3
#> carrier month dest
#> <chr> <int> <chr>
#> 1 UA 1 IAH
#> 2 UA 1 IAH
#> 3 AA 1 MIA
#> 4 B6 1 BQN
#> 5 DL 1 ATL
#> 6 UA 1 ORD
#> 7 B6 1 FLL
#> 8 EV 1 IAD
#> 9 B6 1 MCO
#> 10 AA 1 ORD
#> # ... with 336,766 more rows
My end-goal: I'm interested to know about each carrier
in each month
: whether it flew to the most popular destinations. I define "most popular" by the top-5 most frequent dest
values in each month, then intersecting all months' top-5s.
step 1
I start by simple aggregation by months:
my_flights_agg <-
my_flights_raw %>%
count(month, dest, name = "n_obs") %>%
arrange(month, desc(n_obs))
my_flights_agg
#> # A tibble: 1,113 x 3
#> month dest n_obs
#> <int> <chr> <int>
#> 1 1 ATL 1396
#> 2 1 ORD 1269
#> 3 1 BOS 1245
#> 4 1 MCO 1175
#> 5 1 FLL 1161
#> 6 1 LAX 1159
#> 7 1 CLT 1058
#> 8 1 MIA 981
#> 9 1 SFO 889
#> 10 1 DCA 865
#> # ... with 1,103 more rows
step 2
And now I'm going to cut the data to keep only the top 5 most popular per month.
my_flights_top_5_by_month <-
my_flights_agg %>%
group_by(month) %>%
slice_max(order_by = n_obs, n = 5)
my_flights_top_5_by_month
#> # A tibble: 60 x 3
#> # Groups: month [12]
#> month dest n_obs
#> <int> <chr> <int>
#> 1 1 ATL 1396
#> 2 1 ORD 1269
#> 3 1 BOS 1245
#> 4 1 MCO 1175
#> 5 1 FLL 1161
#> 6 2 ATL 1267
#> 7 2 ORD 1197
#> 8 2 BOS 1182
#> 9 2 MCO 1110
#> 10 2 FLL 1073
#> # ... with 50 more rows
step 3
Now simply get the unique()
of my_flights_top_5_by_month$dest
:
my_flights_top_dest_across_months <- unique(my_flights_top_5_by_month$dest)
## [1] "ATL" "ORD" "BOS" "MCO" "FLL" "LAX" "SFO" "CLT"
Here's my question: given my_flights_top_dest_across_months
, how can I summarize my_flights_raw
to distinct carrier
& month
, such that the collapsing principle is whether each combination of carrier
& month
had flawn to each of the dest
values in my_flights_top_dest_across_months
?
desired output
## carrier month ATL ORD BOS MCO FLL LAX SFO CLT
## <chr> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 9E 1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 2 9E 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 3 9E 3 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 4 9E 4 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 5 9E 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 6 9E 6 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 7 9E 7 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 8 9E 8 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 9 9E 9 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 10 9E 10 FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## # ... with 175 more rows
I currently have the following code that is simply inefficient. It works fine for the example flights
data, but is taking forever when applied on a large dataset (with several millions rows and groups). Any idea how the task described above can be done more efficiently?
# too slow :(
my_flights_raw %>%
group_by(carrier, month) %>%
summarise(destinations_vec = list(unique(dest))) %>%
add_column(top_dest = list(my_flights_top_dest_across_month)) %>%
mutate(are_top_dest_included = purrr::map2(.x = destinations_vec, .y = top_dest, .f = ~ .y %in% .x ), .keep = "unused") %>%
mutate(across(are_top_dest_included, ~purrr::map(.x = ., .f = ~setNames(object = .x, nm = my_flights_top_dest_across_month)) )) %>%
tidyr::unnest_wider(are_top_dest_included)
CodePudding user response:
It is quite possible that using the data.table
library will be faster here. I will not argue. But I have mastered dplyr
and would like to offer a pretty cool solution using the functions from this particular library.
First, let's prepare two little auxiliary functions. We will see how they work later.
library(nycflights13)
library(tidyverse)
ftopDest = function(data, ntop){
data %>%
group_by(dest) %>%
summarise(ndest = n()) %>%
arrange(desc(ndest)) %>%
pull(dest) %>% .[1:ntop]
}
carrierToTopDest = function(data, topDest){
data %>% mutate(carrierToToDest = dest %in% topDest)
}
Now you only need one simple mutation!
df = flights %>% nest_by(year, month) %>% #Step 1
mutate(topDest = list(ftopDest(data, 5)), #Step 2
data = list(carrierToTopDest(data, topDest))) #Step 3
But let me describe step by step what is happening here.
In step one, let's nest the data into an internal tibble
named data
.
Output after Step 1
# A tibble: 12 x 3
# Rowwise: year, month
year month data
<int> <int> <list<tibble[,17]>>
1 2013 1 [27,004 x 17]
2 2013 2 [24,951 x 17]
3 2013 3 [28,834 x 17]
4 2013 4 [28,330 x 17]
5 2013 5 [28,796 x 17]
6 2013 6 [28,243 x 17]
7 2013 7 [29,425 x 17]
8 2013 8 [29,327 x 17]
9 2013 9 [27,574 x 17]
10 2013 10 [28,889 x 17]
11 2013 11 [27,268 x 17]
12 2013 12 [28,135 x 17]
In step 2, we add the most popular flight destinations.
Output after step 2
# A tibble: 12 x 4
# Rowwise: year, month
year month data topDest
<int> <int> <list<tibble[,17]>> <list>
1 2013 1 [27,004 x 17] <chr [5]>
2 2013 2 [24,951 x 17] <chr [5]>
3 2013 3 [28,834 x 17] <chr [5]>
4 2013 4 [28,330 x 17] <chr [5]>
5 2013 5 [28,796 x 17] <chr [5]>
6 2013 6 [28,243 x 17] <chr [5]>
7 2013 7 [29,425 x 17] <chr [5]>
8 2013 8 [29,327 x 17] <chr [5]>
9 2013 9 [27,574 x 17] <chr [5]>
10 2013 10 [28,889 x 17] <chr [5]>
11 2013 11 [27,268 x 17] <chr [5]>
12 2013 12 [28,135 x 17] <chr [5]>
In the last step, we add the carrierToToDest
variable to the data
variable, which determines whether the flight was going to one of the ntop
places from the given month.
Output after step 3
# A tibble: 12 x 4
# Rowwise: year, month
year month data topDest
<int> <int> <list> <list>
1 2013 1 <tibble [27,004 x 18]> <chr [5]>
2 2013 2 <tibble [24,951 x 18]> <chr [5]>
3 2013 3 <tibble [28,834 x 18]> <chr [5]>
4 2013 4 <tibble [28,330 x 18]> <chr [5]>
5 2013 5 <tibble [28,796 x 18]> <chr [5]>
6 2013 6 <tibble [28,243 x 18]> <chr [5]>
7 2013 7 <tibble [29,425 x 18]> <chr [5]>
8 2013 8 <tibble [29,327 x 18]> <chr [5]>
9 2013 9 <tibble [27,574 x 18]> <chr [5]>
10 2013 10 <tibble [28,889 x 18]> <chr [5]>
11 2013 11 <tibble [27,268 x 18]> <chr [5]>
12 2013 12 <tibble [28,135 x 18]> <chr [5]>
How now we can see the most popular places. Let's do this:
df %>% mutate(topDest = paste(topDest, collapse = " "))
output
# A tibble: 12 x 4
# Rowwise: year, month
year month data topDest
<int> <int> <list> <chr>
1 2013 1 <tibble [27,004 x 18]> ATL ORD BOS MCO FLL
2 2013 2 <tibble [24,951 x 18]> ATL ORD BOS MCO FLL
3 2013 3 <tibble [28,834 x 18]> ATL ORD BOS MCO FLL
4 2013 4 <tibble [28,330 x 18]> ATL ORD LAX BOS MCO
5 2013 5 <tibble [28,796 x 18]> ORD ATL LAX BOS SFO
6 2013 6 <tibble [28,243 x 18]> ORD ATL LAX BOS SFO
7 2013 7 <tibble [29,425 x 18]> ORD ATL LAX BOS CLT
8 2013 8 <tibble [29,327 x 18]> ORD ATL LAX BOS SFO
9 2013 9 <tibble [27,574 x 18]> ORD LAX ATL BOS CLT
10 2013 10 <tibble [28,889 x 18]> ORD ATL LAX BOS CLT
11 2013 11 <tibble [27,268 x 18]> ATL ORD LAX BOS CLT
12 2013 12 <tibble [28,135 x 18]> ATL LAX MCO ORD CLT
Can we see flights to these destinations? Of course, it's not difficult.
df %>% select(-topDest) %>%
unnest(data) %>%
filter(carrierToToDest) %>%
select(year, month, flight, carrier, dest)
Output
# A tibble: 80,941 x 5
# Groups: year, month [12]
year month flight carrier dest
<int> <int> <int> <chr> <chr>
1 2013 1 461 DL ATL
2 2013 1 1696 UA ORD
3 2013 1 507 B6 FLL
4 2013 1 79 B6 MCO
5 2013 1 301 AA ORD
6 2013 1 1806 B6 BOS
7 2013 1 371 B6 FLL
8 2013 1 4650 MQ ATL
9 2013 1 1743 DL ATL
10 2013 1 3768 MQ ORD
# ... with 80,931 more rows
This is my recipe. Very simple and transparent in my opinion. I would be extremely obligated if you would try it on your data and let me know with efficiency.
Small update
I just noticed that I wanted to group not only after year
(although you don't mention it, it must be so), month
, but also by the carrier
variable. So let's add it as another grouping variable.
df = flights %>% nest_by(year, month, carrier) %>%
mutate(topDest = list(ftopDest(data, 5)),
data = list(carrierToTopDest(data, topDest)))
output
# A tibble: 185 x 5
# Rowwise: year, month, carrier
year month carrier data topDest
<int> <int> <chr> <list> <list>
1 2013 1 9E <tibble [1,573 x 17]> <chr [5]>
2 2013 1 AA <tibble [2,794 x 17]> <chr [5]>
3 2013 1 AS <tibble [62 x 17]> <chr [5]>
4 2013 1 B6 <tibble [4,427 x 17]> <chr [5]>
5 2013 1 DL <tibble [3,690 x 17]> <chr [5]>
6 2013 1 EV <tibble [4,171 x 17]> <chr [5]>
7 2013 1 F9 <tibble [59 x 17]> <chr [5]>
8 2013 1 FL <tibble [328 x 17]> <chr [5]>
9 2013 1 HA <tibble [31 x 17]> <chr [5]>
10 2013 1 MQ <tibble [2,271 x 17]> <chr [5]>
# ... with 175 more rows
Now let's get acquainted with the new top 5 directions.
df %>% mutate(topDest = paste(topDest, collapse = " "))
output
# A tibble: 185 x 5
# Rowwise: year, month, carrier
year month carrier data topDest
<int> <int> <chr> <list> <chr>
1 2013 1 9E <tibble [1,573 x 17]> BOS PHL CVG MSP ORD
2 2013 1 AA <tibble [2,794 x 17]> DFW MIA ORD LAX BOS
3 2013 1 AS <tibble [62 x 17]> SEA NA NA NA NA
4 2013 1 B6 <tibble [4,427 x 17]> FLL MCO BOS PBI SJU
5 2013 1 DL <tibble [3,690 x 17]> ATL DTW MCO FLL MIA
6 2013 1 EV <tibble [4,171 x 17]> IAD DTW DCA RDU CVG
7 2013 1 F9 <tibble [59 x 17]> DEN NA NA NA NA
8 2013 1 FL <tibble [328 x 17]> ATL CAK MKE NA NA
9 2013 1 HA <tibble [31 x 17]> HNL NA NA NA NA
10 2013 1 MQ <tibble [2,271 x 17]> RDU CMH ORD BNA ATL
# ... with 175 more rows
Summing up, I would like to add that the form is very clear for me. I can see the most popular df%>% mutate (topDest = paste (topDest, collapse =" "))
directions. I can filter all flights to the most popular destinations df%>% select (-topDest)%>% unnest (data)%>% filter (carrierToToDest)%>% select (year, month, flight, carrier, dest)
and do any other transformations. I do not think that presenting the same information wider on over 100 variables is convenient for any analysis.
However, if you really need wider form, let me know. We'll do it this way.
CodePudding user response:
Does this do what you want? As far as I can tell it matches your output but has more rows because it includes all months for all carriers; carrier
"OO" only has flights in 5 months and your version only shows those 5 months in the summary.
With the data as provided (336k rows), this takes a similar amount of time as your function, but it's faster as you deal with larger data. When I run these on data 100x as big after setting my_flights_raw <- my_flights_raw %>% tidyr::uncount(100)
, to make it 33M rows, the code below is about 40% faster.
Given the large number of groups you're dealing with, I expect this is a situation where data.table
will really shine with better performance.
library(tidyverse)
my_flights_raw %>%
count(carrier, month, dest) %>%
complete(carrier, month, dest) %>%
filter(dest %in% my_flights_top_dest_across_months) %>%
mutate(n = if_else(!is.na(n), TRUE, FALSE)) %>%
pivot_wider(names_from = dest, values_from = n)