Home > database >  How to take the most recent row of data in R?
How to take the most recent row of data in R?

Time:05-26

If i have this dataframe:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
  date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010','30-11-2010', '31-12-2010',
           '30-04-2011','31-07-2011','30-09-2011','30-11-2011', '31-12-2011', 
           '31-12-2011', '31-12-2012'),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA)
)

How can i go about creating a column which identifies what the most recent non NA rows are for that group of period and website?

So the final output would look like this:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
  date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010','30-11-2010', '31-12-2010',
           '30-04-2011','31-07-2011','30-09-2011','30-11-2011', '31-12-2011', 
           '31-12-2011', '31-12-2012'),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA), 
  most_recent = c('no','yes', 'no', 'no', 'no', 'no', 'no','yes','yes','yes','yes','yes','yes','no')
)

I am trying to identify that when the first non-na value occurs for the group of period and website when sorted by the most recent date, then mark all the values for this period and website as "yes" in the most_recent column

So what you have is the following:

  • google 2011END is the latest value by date so is a yes
  • facebook 2011q1 to 2011END will have a yes value since there is one non-na value being 2011END and this is the most recent date and has a non-na value
  • youtube 2011END is yes - since it is the first non-na value we occur when we sort by date as 2012 has no value it is a no value

I know it involves a group by but unsure where to go from there

This has been updated for clarity

CodePudding user response:

The code below selects the most recent non NA row by website.
As this isn't fully your expected result, as suggested in comments please clarify your question if necessary.

data[,most_recent:=fifelse(!is.na(values)&date==.SD[!is.na(values),max(date)],'yes','no'),by=website][]

    period       date  website values most_recent
 1: 2010END 31-12-2010   google      1          no
 2: 2011END 31-12-2011   google      2         yes
 3:  2010Q1 30-04-2010 facebook      1          no
 4:  2010Q2 31-07-2010 facebook      2          no
 5:  2010Q3 30-09-2010 facebook      3          no
 6:  2010Q4 30-11-2010 facebook     NA          no
 7: 2010END 31-12-2010 facebook      5          no
 8:  2011Q1 30-04-2011 facebook     NA          no
 9:  2011Q2 31-07-2011 facebook     NA          no
10:  2011Q3 30-09-2011 facebook     NA          no
11:  2011Q4 30-11-2011 facebook     NA          no
12: 2011END 31-12-2011 facebook     10         yes
13: 2011END 31-12-2011  youtube     20         yes
14: 2012END 31-12-2012  youtube     NA          no
  • Related