Home > Blockchain >  Creating a subset made up all rows on the latest date
Creating a subset made up all rows on the latest date

Time:07-26

I've got a massive dataframe (~4million observations) of the following pattern:

Date        Type Value
2019-02-09  HUF  12
2019-02-09  HAD  226394
2019-02-09  WOK  27566
2019-02-09  STR  217098
2019-02-09  HUF  208463
2019-02-09  HAD  9320
2020-02-09  WOK  156607
2020-02-09  STR  19790
2020-02-09  HUF  24541
2020-02-09  HAD  1074419
2021-02-09  WOK  17250
2021-02-09  STR  12249
2021-02-09  HUF  43651
2021-02-09  HAD  45121

I'd like to create a subset of this dataframe which gathers together all rows that have the latest date in it.

I've attempted it using this code:

latest <- df %>% group_by(type) %>% filter(as.Date(Date) == max(as.Date(Date)))

However it doesn't seem to do anything. Does anyone have any advice on how to do this in particular for a very large dataframe?

CodePudding user response:

There are several problems:

  • the library statement is missing
  • Type is capitalized in the data but not in the code
  • ungroup is missing. If it is omitted it will still give what looks like a good answer but the problem is that it will retain the grouping information which could cause problems in further processing.
  • you don't actually need to convert to Date class although that would not prevent it from working
  • would be easier to use slice_max
  • the data should be shown in the question in reproducible form using dput as per the instructions at the top of the tag. I have done that for you in the Note at the end. It is possible that what you have differs from what we have in the Note but there is no way of knowing given the ambiguity in the question.

Incorporating the above with the input shown in the Note at the end we have the following

library(dplyr)
df %>% group_by(Type) %>% slice_max(Date) %>% ungroup

giving:

# A tibble: 4 × 3
  Date       Type  Value
  <chr>      <chr> <int>
1 2021-02-09 HAD   45121
2 2021-02-09 HUF   43651
3 2021-02-09 STR   12249
4 2021-02-09 WOK   17250

Note

df <- structure(list(Date = c("2019-02-09", "2019-02-09", "2019-02-09", 
"2019-02-09", "2019-02-09", "2019-02-09", "2020-02-09", "2020-02-09", 
"2020-02-09", "2020-02-09", "2021-02-09", "2021-02-09", "2021-02-09", 
"2021-02-09"), Type = c("HUF", "HAD", "WOK", "STR", "HUF", "HAD", 
"WOK", "STR", "HUF", "HAD", "WOK", "STR", "HUF", "HAD"), Value = c(12L, 
226394L, 27566L, 217098L, 208463L, 9320L, 156607L, 19790L, 24541L, 
1074419L, 17250L, 12249L, 43651L, 45121L)), class = "data.frame", row.names = c(NA, 
-14L))

We obtained the above by copying the data shown in the question to the clipboard and then running this adding the df<- manually.

df <- read.table("clipboard", header = TRUE)
dput(df)

CodePudding user response:

Your code is working but maybe you are trying to do something else here? Your code keep the most recent row for each group, not the most recent rows in the dataset. See below:

 library(tidyverse)

df <- data.frame("Date" = c("2019-02-09", "2019-02-09", "2019-02-09", "2019-01-09", "2019-01-09", "2019-02-09"),
             "type" = c("A", "B", "A", "C", "A", "A"),
             "value" = c(10,13,15,20,50,32))

# Just the most recent rows
 df %>% 
  # group_by(type) %>%
  filter(as.Date(Date) == max(as.Date(Date)))

# Most recent rows per group - for example, group A with date 2019-01-09 is dropped
 df %>% 
  group_by(type) %>%
  filter(as.Date(Date) == max(as.Date(Date)))

Notice that in the first one, we drop the group_by and get the most recent rows in the whole dataset. Yours with group_by gets the most recent per group. If you need JUST the most recent rows, drop the group_by

  • Related