Home > database >  Identify the most recent values by both group and date in R
Identify the most recent values by both group and date in R

Time:05-26

current df:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
  date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010','30-11-2010', '31-12-2010',
           '30-04-2011','31-07-2011','30-09-2011','30-11-2011', '31-12-2011', 
           '31-12-2011', '31-12-2012'),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA)
)

How can i go about creating a column which identifies the most recent by date non-na values for the grouping of both period and website, hence

So the final output would look like this:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
  date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010','30-11-2010', '31-12-2010',
           '30-04-2011','31-07-2011','30-09-2011','30-11-2011', '31-12-2011', 
           '31-12-2011', '31-12-2012'),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA), 
  most_recent = c('no','yes', 'no', 'no', 'no', 'no', 'no','yes','yes','yes','yes','yes','yes','no')
)

I am trying to identify that when the first non-na value occurs for the group of period and website when sorted by the most recent date, then mark all the values for this period and website as "yes" in the most_recent column

So what you have is the following:

  • google 2011END is the latest value by date so is a yes - the value for 2010END is older so not selected
  • facebook 2011q1 to 2011END will have a yes value since there is one non-na value being 2011END and this is the most recent date and has a non-na value
  • youtube 2011END is yes - since it is the first non-na value we occur when we sort by date as 2012 has no value it is a no value

CodePudding user response:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

data <- tibble(
  period = c(
    "2010END", "2011END",
    "2010Q1", "2010Q2", "2010Q3", "2010Q4", "2010END",
    "2011Q1", "2011Q2", "2011Q3", "2011Q4", "2011END",
    "2011END", "2012END"
  ),
  date = c(
    "31-12-2010", "31-12-2011", "30-04-2010", "31-07-2010", "30-09-2010", "30-11-2010", "31-12-2010",
    "30-04-2011", "31-07-2011", "30-09-2011", "30-11-2011", "31-12-2011",
    "31-12-2011", "31-12-2012"
  ),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA)
)

# group only by website
data %>%
  mutate(date = date %>% parse_date(format = "%d-%m-%Y")) %>%
  group_by(website) %>%
  mutate(most_recent = ifelse(date == min(date, na.rm = TRUE), "yes", "no"))
#> # A tibble: 14 × 5
#> # Groups:   website [3]
#>    period  date       website  values most_recent
#>    <chr>   <date>     <chr>     <dbl> <chr>      
#>  1 2010END 2010-12-31 google        1 yes        
#>  2 2011END 2011-12-31 google        2 no         
#>  3 2010Q1  2010-04-30 facebook      1 yes        
#>  4 2010Q2  2010-07-31 facebook      2 no         
#>  5 2010Q3  2010-09-30 facebook      3 no         
#>  6 2010Q4  2010-11-30 facebook     NA no         
#>  7 2010END 2010-12-31 facebook      5 no         
#>  8 2011Q1  2011-04-30 facebook     NA no         
#>  9 2011Q2  2011-07-31 facebook     NA no         
#> 10 2011Q3  2011-09-30 facebook     NA no         
#> 11 2011Q4  2011-11-30 facebook     NA no         
#> 12 2011END 2011-12-31 facebook     10 no         
#> 13 2011END 2011-12-31 youtube      20 yes        
#> 14 2012END 2012-12-31 youtube      NA no

# group by period and website
data %>%
  mutate(date = date %>% parse_date(format = "%d-%m-%Y")) %>%
  group_by(period, website) %>%
  mutate(most_recent = ifelse(date == min(date, na.rm = TRUE), "yes", "no"))
#> # A tibble: 14 × 5
#> # Groups:   period, website [14]
#>    period  date       website  values most_recent
#>    <chr>   <date>     <chr>     <dbl> <chr>      
#>  1 2010END 2010-12-31 google        1 yes        
#>  2 2011END 2011-12-31 google        2 yes        
#>  3 2010Q1  2010-04-30 facebook      1 yes        
#>  4 2010Q2  2010-07-31 facebook      2 yes        
#>  5 2010Q3  2010-09-30 facebook      3 yes        
#>  6 2010Q4  2010-11-30 facebook     NA yes        
#>  7 2010END 2010-12-31 facebook      5 yes        
#>  8 2011Q1  2011-04-30 facebook     NA yes        
#>  9 2011Q2  2011-07-31 facebook     NA yes        
#> 10 2011Q3  2011-09-30 facebook     NA yes        
#> 11 2011Q4  2011-11-30 facebook     NA yes        
#> 12 2011END 2011-12-31 facebook     10 yes        
#> 13 2011END 2011-12-31 youtube      20 yes        
#> 14 2012END 2012-12-31 youtube      NA yes

Created on 2022-05-25 by the reprex package (v2.0.0)

  • Related