I've got a massive dataframe (~4million observations) of the following pattern:
Date Type Value
2019-02-09 HUF 12
2019-02-09 HAD 226394
2019-02-09 WOK 27566
2019-02-09 STR 217098
2019-02-09 HUF 208463
2019-02-09 HAD 9320
2020-02-09 WOK 156607
2020-02-09 STR 19790
2020-02-09 HUF 24541
2020-02-09 HAD 1074419
2021-02-09 WOK 17250
2021-02-09 STR 12249
2021-02-09 HUF 43651
2021-02-09 HAD 45121
I'd like to create a subset of this dataframe which gathers together all rows that have the latest date in it.
I've attempted it using this code:
latest <- df %>% group_by(type) %>% filter(as.Date(Date) == max(as.Date(Date)))
However it doesn't seem to do anything. Does anyone have any advice on how to do this in particular for a very large dataframe?
CodePudding user response:
There are several problems:
- the
library
statement is missing Type
is capitalized in the data but not in the codeungroup
is missing. If it is omitted it will still give what looks like a good answer but the problem is that it will retain the grouping information which could cause problems in further processing.- you don't actually need to convert to Date class although that would not prevent it from working
- would be easier to use
slice_max
- the data should be shown in the question in reproducible form using
dput
as per the instructions at the top of the r tag. I have done that for you in the Note at the end. It is possible that what you have differs from what we have in the Note but there is no way of knowing given the ambiguity in the question.
Incorporating the above with the input shown in the Note at the end we have the following
library(dplyr)
df %>% group_by(Type) %>% slice_max(Date) %>% ungroup
giving:
# A tibble: 4 × 3
Date Type Value
<chr> <chr> <int>
1 2021-02-09 HAD 45121
2 2021-02-09 HUF 43651
3 2021-02-09 STR 12249
4 2021-02-09 WOK 17250
Note
df <- structure(list(Date = c("2019-02-09", "2019-02-09", "2019-02-09",
"2019-02-09", "2019-02-09", "2019-02-09", "2020-02-09", "2020-02-09",
"2020-02-09", "2020-02-09", "2021-02-09", "2021-02-09", "2021-02-09",
"2021-02-09"), Type = c("HUF", "HAD", "WOK", "STR", "HUF", "HAD",
"WOK", "STR", "HUF", "HAD", "WOK", "STR", "HUF", "HAD"), Value = c(12L,
226394L, 27566L, 217098L, 208463L, 9320L, 156607L, 19790L, 24541L,
1074419L, 17250L, 12249L, 43651L, 45121L)), class = "data.frame", row.names = c(NA,
-14L))
We obtained the above by copying the data shown in the question to the clipboard and then running this adding the df<-
manually.
df <- read.table("clipboard", header = TRUE)
dput(df)
CodePudding user response:
Your code is working but maybe you are trying to do something else here? Your code keep the most recent row for each group, not the most recent rows in the dataset. See below:
library(tidyverse)
df <- data.frame("Date" = c("2019-02-09", "2019-02-09", "2019-02-09", "2019-01-09", "2019-01-09", "2019-02-09"),
"type" = c("A", "B", "A", "C", "A", "A"),
"value" = c(10,13,15,20,50,32))
# Just the most recent rows
df %>%
# group_by(type) %>%
filter(as.Date(Date) == max(as.Date(Date)))
# Most recent rows per group - for example, group A with date 2019-01-09 is dropped
df %>%
group_by(type) %>%
filter(as.Date(Date) == max(as.Date(Date)))
Notice that in the first one, we drop the group_by
and get the most recent rows in the whole dataset. Yours with group_by
gets the most recent per group. If you need JUST the most recent rows, drop the group_by