Home > Software design >  Use of tail() within groups confuses me a little bit
Use of tail() within groups confuses me a little bit

Time:07-05

I have 2 DFs that contains basically a Date, a Volume, and one more column. On one df that column is ticker, and on the other is Position.

> df
# A tibble: 54,885 × 3
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        2 2003-01-02   1809
 2        1 2003-01-03   1831
 3        1 2003-01-06   2465
 4        1 2003-01-07   1215
 5        1 2003-01-08    955
 6        1 2003-01-09   1192
 7        1 2003-01-10   1901
 8        1 2003-01-13   2110
 9        1 2003-01-14   2521
10        1 2003-01-15   1704
# … with 54,875 more rows 
> df2
# A tibble: 154 × 3
   ticker date           volume
   <chr>  <date>          <dbl>
 1 X29L2  2022-06-01  34015836 
 2 X29L2  2022-06-02 255554864 
 3 X29L2  2022-06-03  52779492 
 4 X29L2  2022-06-06 846971456 
 5 X29L2  2022-06-07 433462592 
 6 X29L2  2022-06-08     31365.
 7 X29L2  2022-06-09   8364060.
 8 X29L2  2022-06-10   7550020 
 9 X29L2  2022-06-13  39714244 
10 X29L2  2022-06-14 821900608 
# … with 144 more rows

I want for each Position, and for each ticker, the last Dates available.

Let's start with df2:

> df2 %>% 
    group_by(ticker) %>%
    do(tail(., n=1))
# A tibble: 7 × 3
# Groups:   ticker [7]
  ticker date           volume
  <chr>  <date>          <dbl>
1 X16D2  2022-07-04 115893395.
2 X16G2  2022-07-04 434399604.
3 X17F3  2022-07-04 206883540.
4 X19Y3  2022-07-04 317255104.
5 X20E3  2022-07-04 291394381.
6 X21O2  2022-07-04 186407123.
7 X29L2  2022-07-04  69635266.

That works great. The range of Dates is

> range(df2$date)
[1] "2022-06-01" "2022-07-04"

and this is the correct output I was expecting.

Let's do the same with df:

> df %>%   
    group_by(Position) %>%
    do(tail(., n=1))
# A tibble: 20 × 3
# Groups:   Position [20]
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        1 2021-12-30 163917
 2        2 2021-11-30 969631
 3        3 2021-10-29 153590
 4        4 2021-09-30  97777
 5        5 2021-08-31 188115
 6        6 2022-07-01   5277
 7        7 2022-06-30  24808
 8        8 2022-05-31  28236
 9        9 2022-04-29   1499
10       10 2022-03-31    197
11       11 2022-02-25    500
12       12 2022-01-31     NA
13       13 2021-12-30     NA
14       14 2015-09-30     NA
15       15 2010-10-29     NA
16       16 2010-09-30     NA
17       17 2010-08-31     NA
18       18 2010-07-30     NA
19       19 2010-03-31     NA
20       20 2009-12-30     NA

The range of Position is:

> range(df$Position)
[1]  1 20

But as it is shown, it retrieves different Dates and they're not the last ones. the range of Dates here is:

> range(df$Date)
[1] "2003-01-02" "2022-07-01"

And there is certainly data for Position 1 up to 12 as shown below:

> df %>% filter(Date == '2022-07-01')
# A tibble: 12 × 3
   Position Date       Volume
      <dbl> <date>      <dbl>
 1        7 2022-07-01   2052
 2        8 2022-07-01   2644
 3        9 2022-07-01    357
 4       10 2022-07-01    260
 5       11 2022-07-01    491
 6       12 2022-07-01    100
 7        1 2022-07-01 525635
 8        2 2022-07-01 107201
 9        3 2022-07-01  39664
10        4 2022-07-01  12479
11        5 2022-07-01  12568
12        6 2022-07-01   5277

Anyone can help me understand why they worked so differently and what can I do to get the same result as in df2?

Thanks!

CodePudding user response:

It may be a case of not arrangeing the data before calling the tail within do. Either we arrange the data by 'Position', 'Date' and then do a group by 'Position' with slice_tail

library(dplyr)
df %>% 
   arrange(Position, Date) %>%
   group_by(Position) %>% 
   slice_tail(n = 1) %>%
   # or may use do, but it can be slow
   # do(tail(., n = 1))
   ungroup   

Or directly use slice_max without arrangeing

df %>%
   group_by(Position) %>%
   slice_max(n =1, order_by = Date, with_ties = FALSE) %>%
   ungroup

Or may also use distinct after arrangeing in descending order as distinct returns the first unique value

df %>%
   arrange(Position, desc(Date)) %>%
   distinct(Position, .keep_all = TRUE)

The advantage with this method is it doesn't require grouping and then ungroup (to remove the group attribute)

  • Related