When does dplyr
return ties when using slice_min
and slice_max
? I'm seeing some inconsistencies and can't seem to find any clarification online or in their documentation.
Examples:
library(dplyr)
#there is a tie but only returns 5 rows, not the bottom 5 mpg's
mtcars %>% slice_min(mpg, n = 5, with_ties = TRUE)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
#> Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
#> Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
#> Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
#this will return the top two as a tie when above it did not
mtcars %>%
slice_min(mpg, n = 1, with_ties = TRUE)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
#another example of it using ties to return more than 3 rows
starwars %>%
select(gender, mass) %>%
group_by(gender) %>%
slice_min(mass, n = 3, with_ties = TRUE)
# A tibble: 8 x 2
# Groups: gender [3]
# gender mass
#
#1 feminine 45
#2 feminine 49
#3 feminine 50
#4 feminine 50
#5 masculine 15
#6 masculine 17
#7 masculine 20
#8 NA 48
Am I missing something here?
CodePudding user response:
The "tie" refers to the borderline entry, not any ties at all. So if the last element included is tied with an element that would be excluded otherwise, "with_ties" pulls it into the output.
my_data <- data.frame(a = c(1, 1, 2, 2))
> slice_min(my_data, a, n = 1)
a
1 1
2 1
> slice_min(my_data, a, n = 2)
a
1 1
2 1
> slice_min(my_data, a, n = 3)
a
1 1
2 1
3 2
4 2
If you want the three lowest mpgs, you could start with a list of distinct mpgs, slice those, and join to original data:
mtcars %>%
distinct(mpg) %>%
slice_min(mpg, n = 3) %>%
left_join(mtcars)
Joining, by = "mpg"
mpg cyl disp hp drat wt qsec vs am gear carb
1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
4 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
CodePudding user response:
From the documentation of slice_min/slice_max
It says that:
with_ties
Should ties be kept together? The default, TRUE, may return more rows than you request. Use FALSE to ignore ties, and return the first n rows.
This means that in cases the number of minimal values you ask for is smaller than the actual number of entries with this minimal value, you will get a larger output than you expected.
CodePudding user response:
There can be some issues with slice_min/slice_max
when there is only a single value in the data. It also means that suppose the number of rows is 10000, it will return all the rows whether it is tied or not
dat <- tibble(a = rep(1, 5))
> slice_min(dat, a, n = 1)
# A tibble: 5 × 1
a
<dbl>
1 1
2 1
3 1
4 1
5 1
> slice_min(dat, a, n = 1, with_ties = TRUE)
# A tibble: 5 × 1
a
<dbl>
1 1
2 1
3 1
4 1
5 1
If there are duplicate values and option is to arrange
and use slice
mtcars %>%
arrange(desc(mpg)) %>%
slice(1)
We may get the output in a single filter
as well
mtcars %>% filter(mpg %in% tail(unique(sort(mpg)), 3))
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
> mtcars %>% filter(mpg %in% head(unique(sort(mpg)), 3))
mpg cyl disp hp drat wt qsec vs am gear carb
Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4