Having a data frame, I want to generate a new list-column containing named vector(s) (one vector per row). Each vector derives its names and values from 2 other data frame columns. But I'm stuck because I want to do it:
- by group
- as computationally-efficient as possible
Example
Let's take mpg
dataset from {ggplot2}
to illustrate the by group principle. I want to lump together pairs of cty
and hwy
values, grouped by distinct combinations of manufacturer
& year
. So we can do:
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
my_mpg <-
mpg %>%
select(manufacturer, year, cty, hwy)
via_tidyr_nest <-
my_mpg %>%
group_by(manufacturer, year) %>%
nest()
via_tidyr_nest
#> # A tibble: 30 x 3
#> # Groups: manufacturer, year [30]
#> manufacturer year data
#> <chr> <int> <list>
#> 1 audi 1999 <tibble [9 x 2]>
#> 2 audi 2008 <tibble [9 x 2]>
#> 3 chevrolet 2008 <tibble [12 x 2]>
#> 4 chevrolet 1999 <tibble [7 x 2]>
#> 5 dodge 1999 <tibble [16 x 2]>
#> 6 dodge 2008 <tibble [21 x 2]>
#> 7 ford 1999 <tibble [15 x 2]>
#> 8 ford 2008 <tibble [10 x 2]>
#> 9 honda 1999 <tibble [5 x 2]>
#> 10 honda 2008 <tibble [4 x 2]>
#> # ... with 20 more rows
Created on 2021-09-27 by the reprex package (v0.3.0)
This is perfect except that I don't want a nested tibble but a nested named vector. (the reason: once we store the output as an object in the environment, the named vector version is lighter in size than the nested tibble one).
The working but undesired solution would take via_tidyr_nest
and convert the nested tibble into a named vector.
expected_output <-
via_tidyr_nest %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
expected_output
#> # A tibble: 30 x 3
#> # Groups: manufacturer, year [30]
#> manufacturer year desired_named_vec
#> <chr> <int> <list>
#> 1 audi 1999 <int [9]>
#> 2 audi 2008 <int [9]>
#> 3 chevrolet 2008 <int [12]>
#> 4 chevrolet 1999 <int [7]>
#> 5 dodge 1999 <int [16]>
#> 6 dodge 2008 <int [21]>
#> 7 ford 1999 <int [15]>
#> 8 ford 2008 <int [10]>
#> 9 honda 1999 <int [5]>
#> 10 honda 2008 <int [4]>
#> # ... with 20 more rows
This is undesired because it achieves the desired output via a detour. First it creates a tibble and then it converts to a named vector. While processing time is negligible in this example, in reality I have a large dataset (10 million rows). Thus, adding any extra step is costly. Instead, I wish to arrive at expected_output
with as fewest steps as possible.
One unsuccessful attempt:
library(purrr)
via_summarise_map2_setnames <-
my_mpg %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(.x = cty, .y = hwy, .f = ~setNames(.x, .y)))
#> `summarise()` has grouped output by 'manufacturer', 'year'. You can override using the `.groups` argument.
via_summarise_map2_setnames
#> # A tibble: 234 x 3
#> # Groups: manufacturer, year [30]
#> manufacturer year named_vec
#> <chr> <int> <list>
#> 1 audi 1999 <int [1]>
#> 2 audi 1999 <int [1]>
#> 3 audi 1999 <int [1]>
#> 4 audi 1999 <int [1]>
#> 5 audi 1999 <int [1]>
#> 6 audi 1999 <int [1]>
#> 7 audi 1999 <int [1]>
#> 8 audi 1999 <int [1]>
#> 9 audi 1999 <int [1]>
#> 10 audi 2008 <int [1]>
#> # ... with 224 more rows
Any idea how to go from my_mpg
to expected_output
directly, without creating a tibble in-between?
EDIT
Just a general thought in the context of this question. I understand that the default behavior of tidyr::nest()
is to return a nested tibble. But I didn't find any discussion of this decision. In other words, what if we wanted to choose ourselves the class of the nested data? It could be either a tibble
as default, or otherwise a data.frame
, data.table
, named vector
, etc. Whatever the user chooses as the output class.
EDIT 2
If anyone can think of a data.table
solution, it would be very helpful.
CodePudding user response:
dplyr::group_modify()
and tibble::deframe()
can be used here. Instead of deframe()
the pull(x, cty, hwy)
from your question would work just the same.
library(tidyverse)
mpg |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
#> # A tibble: 30 × 3
#> # Groups: manufacturer, year [30]
#> manufacturer year res
#> <chr> <int> <list>
#> 1 audi 1999 <int [9]>
#> 2 audi 2008 <int [9]>
#> 3 chevrolet 1999 <int [7]>
#> 4 chevrolet 2008 <int [12]>
#> 5 dodge 1999 <int [16]>
#> 6 dodge 2008 <int [21]>
#> 7 ford 1999 <int [15]>
#> 8 ford 2008 <int [10]>
#> 9 honda 1999 <int [5]>
#> 10 honda 2008 <int [4]>
#> # … with 20 more rows
CodePudding user response:
edited : replaced 'map' by 'Map'
I hope this can be useful. Your solution is within 'f', my proposal within 'g'. It uses the index created by dplyr's 'group_by' to collect the needed data to build the named vectors.
f <- function () {
via_tidyr_nest <-
my_mpg %>%
group_by(manufacturer, year) %>%
nest()
expected_output <-
via_tidyr_nest %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
}
g <- function () {
df1 <- my_mpg %>% group_by(manufacturer, year)
df2 <- attr(df1,"groups")
Map(function(rows) {
r <- df1[rows,"cty",drop=TRUE]
setNames(r,df1[rows,"hwy",drop=TRUE])
},
df2$.rows
) -> l
df <- data.frame(manufacturer=df2$manufacturer,year=df2$year,named_vector=I(l))
}
# other solutions
h <- function () {
hdf <- my_mpg %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)))
}
k <- function() {
mpg |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
}
library(microbenchmark)
microbenchmark(OP=f(),Nicolas2=g(),Rui=h(),Till=k())
Unit: milliseconds
expr min lq mean median uq max neval
OP 21.8917 22.64035 24.389126 23.28235 24.70075 39.9593 100
Nicolas2 3.0507 3.15920 3.481469 3.24625 3.57840 7.3173 100
Rui 6.5460 6.75300 7.505564 7.16255 7.64390 12.0359 100
Till 31.2364 32.31115 34.940356 32.92990 36.11505 107.2709 100
CodePudding user response:
Here is a way. Coerce cty
and hwy
to "list"
before setting the names. It seems to work.
library(purrr)
library(dplyr)
data(mpg, package = "ggplot2")
my_mpg <-
mpg %>%
select(manufacturer, year, cty, hwy)
my_mpg %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)))
#`summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
## A tibble: 30 x 3
## Groups: manufacturer [15]
# manufacturer year named_vec
# <chr> <int> <list>
# 1 audi 1999 <int [9]>
# 2 audi 2008 <int [9]>
# 3 chevrolet 1999 <int [7]>
# 4 chevrolet 2008 <int [12]>
# 5 dodge 1999 <int [16]>
# 6 dodge 2008 <int [21]>
# 7 ford 1999 <int [15]>
# 8 ford 2008 <int [10]>
# 9 honda 1999 <int [5]>
#10 honda 2008 <int [4]>
## … with 20 more rows
Benchmarks
Since the question is a performance question, here are the benchmarks of the 4 proposed solutions, so far, the question's, Nicolas2's, Till's and mine above.
f <- function(X) {
X %>%
group_by(manufacturer, year) %>%
nest() %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
}
g <- function(X) {
df1 <- X %>% group_by(manufacturer, year)
df2 <- attr(df1,"groups")
Map(function(rows) {
r <- df1[rows,"cty",drop=TRUE]
setNames(r,df1[rows,"hwy",drop=TRUE])
},
df2$.rows
) -> l
data.frame(manufacturer=df2$manufacturer,year=df2$year,named_vector=I(l))
}
h <- function(X){
X %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)), .groups = "drop")
}
i <- function(X){
X |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
}
mb <- microbenchmark(
Emman = f(my_mpg),
Nicolas2 = g(my_mpg),
Rui = h(my_mpg),
Till = i(my_mpg)
)
print(mb, unit = "relative", order = "median")
#Unit: relative
# expr min lq mean median uq max neval cld
# Rui 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# Nicolas2 1.527957 1.468524 1.478286 1.482185 1.471565 1.724004 100 b
# Emman 4.504185 4.230921 4.215643 4.234087 4.148188 4.170934 100 c
# Till 6.264028 5.813678 5.883107 5.810876 5.744080 5.666524 100 d