I have 2 DFs that contains basically a Date, a Volume, and one more column. On one df that column is ticker, and on the other is Position.
> df
# A tibble: 54,885 × 3
Position Date Volume
<dbl> <date> <dbl>
1 2 2003-01-02 1809
2 1 2003-01-03 1831
3 1 2003-01-06 2465
4 1 2003-01-07 1215
5 1 2003-01-08 955
6 1 2003-01-09 1192
7 1 2003-01-10 1901
8 1 2003-01-13 2110
9 1 2003-01-14 2521
10 1 2003-01-15 1704
# … with 54,875 more rows
> df2
# A tibble: 154 × 3
ticker date volume
<chr> <date> <dbl>
1 X29L2 2022-06-01 34015836
2 X29L2 2022-06-02 255554864
3 X29L2 2022-06-03 52779492
4 X29L2 2022-06-06 846971456
5 X29L2 2022-06-07 433462592
6 X29L2 2022-06-08 31365.
7 X29L2 2022-06-09 8364060.
8 X29L2 2022-06-10 7550020
9 X29L2 2022-06-13 39714244
10 X29L2 2022-06-14 821900608
# … with 144 more rows
I want for each Position, and for each ticker, the last Dates available.
Let's start with df2:
> df2 %>%
group_by(ticker) %>%
do(tail(., n=1))
# A tibble: 7 × 3
# Groups: ticker [7]
ticker date volume
<chr> <date> <dbl>
1 X16D2 2022-07-04 115893395.
2 X16G2 2022-07-04 434399604.
3 X17F3 2022-07-04 206883540.
4 X19Y3 2022-07-04 317255104.
5 X20E3 2022-07-04 291394381.
6 X21O2 2022-07-04 186407123.
7 X29L2 2022-07-04 69635266.
That works great. The range of Dates is
> range(df2$date)
[1] "2022-06-01" "2022-07-04"
and this is the correct output I was expecting.
Let's do the same with df:
> df %>%
group_by(Position) %>%
do(tail(., n=1))
# A tibble: 20 × 3
# Groups: Position [20]
Position Date Volume
<dbl> <date> <dbl>
1 1 2021-12-30 163917
2 2 2021-11-30 969631
3 3 2021-10-29 153590
4 4 2021-09-30 97777
5 5 2021-08-31 188115
6 6 2022-07-01 5277
7 7 2022-06-30 24808
8 8 2022-05-31 28236
9 9 2022-04-29 1499
10 10 2022-03-31 197
11 11 2022-02-25 500
12 12 2022-01-31 NA
13 13 2021-12-30 NA
14 14 2015-09-30 NA
15 15 2010-10-29 NA
16 16 2010-09-30 NA
17 17 2010-08-31 NA
18 18 2010-07-30 NA
19 19 2010-03-31 NA
20 20 2009-12-30 NA
The range of Position is:
> range(df$Position)
[1] 1 20
But as it is shown, it retrieves different Dates and they're not the last ones. the range of Dates here is:
> range(df$Date)
[1] "2003-01-02" "2022-07-01"
And there is certainly data for Position 1 up to 12 as shown below:
> df %>% filter(Date == '2022-07-01')
# A tibble: 12 × 3
Position Date Volume
<dbl> <date> <dbl>
1 7 2022-07-01 2052
2 8 2022-07-01 2644
3 9 2022-07-01 357
4 10 2022-07-01 260
5 11 2022-07-01 491
6 12 2022-07-01 100
7 1 2022-07-01 525635
8 2 2022-07-01 107201
9 3 2022-07-01 39664
10 4 2022-07-01 12479
11 5 2022-07-01 12568
12 6 2022-07-01 5277
Anyone can help me understand why they worked so differently and what can I do to get the same result as in df2?
Thanks!
CodePudding user response:
It may be a case of not arrange
ing the data before calling the tail
within do
. Either we arrange
the data by 'Position', 'Date' and then do a group by 'Position' with slice_tail
library(dplyr)
df %>%
arrange(Position, Date) %>%
group_by(Position) %>%
slice_tail(n = 1) %>%
# or may use do, but it can be slow
# do(tail(., n = 1))
ungroup
Or directly use slice_max
without arrange
ing
df %>%
group_by(Position) %>%
slice_max(n =1, order_by = Date, with_ties = FALSE) %>%
ungroup
Or may also use distinct
after arrange
ing in desc
ending order as distinct
returns the first unique value
df %>%
arrange(Position, desc(Date)) %>%
distinct(Position, .keep_all = TRUE)
The advantage with this method is it doesn't require grouping and then ungroup
(to remove the group attribute)