I'm trying to summarize a data frame, while grouping by a variable. My problem is that when doing such summarizing procedure, I lose other columns that I need.
Consider the following data:
df <-
tibble::tribble(
~id, ~year, ~my_value,
1, 2010, 2,
1, 2013, 2,
1, 2014, 2,
2, 2010, 4,
2, 2012, 3,
2, 2014, 4,
2, 2015, 2,
3, 2015, 3,
3, 2010, 3,
3, 2011, 3
)
I want to group by id
in order to collapse my_value
to a single value. I use the following algorithm:
- IF all values of
my_value
are identical, then simply return the first value, i.e,my_value[1]
. - ELSE return the smallest value, i.e.,
min(my_value)
.
So I wrote a small function that does it:
my_func <- function(x) {
if (var(x) == 0) {
return(x[1])
}
# else:
min(x)
}
And now I can use either dplyr
or data.table
to summarize by id
:
library(dplyr)
library(data.table)
# dplyr
df %>%
group_by(id) %>%
summarise(my_min_val = my_func(my_value))
#> # A tibble: 3 x 2
#> id my_min_val
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
#> 3 3 3
# data.table
setDT(df)[, .(my_min_val = my_func(my_value)), by = "id"]
#> id my_min_val
#> 1: 1 2
#> 2: 2 2
#> 3: 3 3
So far so good. My problem is that I lost the year
value. I want the respective year
value for each chosen my_value
.
My desired output should look like:
# desired output
desired_output <-
tribble(~id, ~my_min_val, ~year,
1, 2, 2010, # because for id 1, var(my_value) is 0, and hence my_value[1] corresponds to year 2010
2, 2, 2015, # because for id 2, var(my_value) is not 0, and hence min(my_value) (which is 2) corresponds to year 2015
3, 3, 2015) # because for id 3, var(my_value) is 0, hence my_value[1] corresponds to year 2015
I especially seek a data.table
solution because my real data is very large (over 1 million rows) and with many groups. Thus efficiency is important. Thanks!
CodePudding user response:
We may use the condition in slice
library(dplyr)
my_func <- function(x) if(var(x) == 0) 1 else which.min(x)
df %>%
group_by(id) %>%
slice(my_func(my_value)) %>%
ungroup
-output
# A tibble: 3 × 3
id year my_value
<dbl> <dbl> <dbl>
1 1 2010 2
2 2 2015 2
3 3 2015 3
Or using data.table
library(data.table)
setDT(df)[df[, .I[my_func(my_value)], id]$V1]
id year my_value
1: 1 2010 2
2: 2 2015 2
3: 3 2015 3
Or with slice_min
and with_ties = FALSE
df %>%
group_by(id) %>%
slice_min(n = 1, order_by = my_value, with_ties = FALSE) %>%
ungroup
-output
# A tibble: 3 × 3
id year my_value
<dbl> <dbl> <dbl>
1 1 2010 2
2 2 2015 2
3 3 2015 3