Home > Software design >  Get the latest non-NA value based on date column by group
Get the latest non-NA value based on date column by group

Time:05-12

I have a dataframe having country_name, date and several columns: column_1, column_2and column_3. I am trying to extract the latest record based on date across several columns.

The dataframe looks like this:

| country_name | date        | column_1| column_2| column_3|
| US           | 2016-11-02  | 7.5     | NA      | NA      |
| US           | 2017-09-12  | NA      | NA      | 9       |
| US           | 2017-09-19  | NA      | 8       | 10      |
| US           | 2020-02-10  | 10      | NA      | NA      |
| US           | 2021-03-10  | NA      | NA      | 7.3     |
| US           | 2021-05-02  | NA      | 3       | NA      |
| UK           | 2016-11-02  | NA      | 2       | NA      |
| UK           | 2017-09-12  | 0.5     | 3       | NA      |
 .
 .

For the US the desired output is:

| country_name | column_1| column_2| column_3|
| US           | 10      | 3       | 7.3     |

For column_1, the value with the latest date is 10 (date: 2020-02-10), for column_2 is 3 (date: 2021-05-02), and for column_3 is 7.3 (date: 2021-03-10). My goal is to apply this logic across several countries. How do I achieve this?

CodePudding user response:

library(dplyr)
library(tidyr)

df1 %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(country_name) %>%
  arrange(date) %>%
  select(-date) %>% 
  fill(everything()) %>% 
  slice(n())

#> # A tibble: 2 x 4
#> # Groups:   country_name [2]
#>   country_name column_1 column_2 column_3
#>   <chr>           <dbl>    <int>    <dbl>
#> 1 UK                0.5        3     NA  
#> 2 US               10          3      7.3

Data:

read.table(text = "country_name  date         column_1 column_2 column_3
                   US            2016-11-02   7.5      NA       NA      
                   US            2017-09-12   NA       NA       9       
                   US            2017-09-19   NA       8        10      
                   US            2020-02-10   10       NA       NA      
                   US            2021-03-10   NA       NA       7.3     
                   US            2021-05-02   NA       3        NA      
                   UK            2016-11-02   NA       2        NA      
                   UK            2017-09-12   0.5      3        NA", 
           header = T, stringsAsFactors = F) -> df1

CodePudding user response:

Here is how we could do it:

  1. If necessary transform date column to date class with ymd() function from lubridate.
  2. group by country_name
  3. Now comes the trick we use across for col1 col2... etc. and collapse in reverse with paste(rev(.).... to get the last value to first place. This is important for the next step.
  4. Use parse_number() from readr package that will extract the first number!
library(dplyr)
library(lubridate)
library(readr)

df %>% 
  mutate(date = ymd(date)) %>% 
  group_by(country_name) %>%
  arrange(date, .by_group = TRUE) %>% 
  summarise(across(starts_with("column"), ~paste(rev(.), collapse = ' '))) %>% 
  mutate(across(-country_name, parse_number))

 country_name column_1 column_2 column_3
  <chr>           <dbl>    <dbl>    <dbl>
1 UK                0.5        3     NA  
2 US               10          3      7.3

CodePudding user response:

You could na.omit and reverse each column and take first element. Then rbind. Take care of the right order and if it's as.Date formatted.

by(transform(dat, date=as.Date(date)), dat$country_name, \(x) {
  cbind(x[1, 1, drop=FALSE], 
        lapply(x[order(x$date), 3:5], \(z) {
          z <- el(rev(na.omit(z)))
          ifelse(length(z) == 1, z, NA_real_)
        }))
}) |> c(make.row.names=FALSE) |> do.call(what=rbind)
#   country_name column_1 column_2 column_3
# 1           UK      0.5        3       NA
# 2           US     10.0        3      7.3

Data:

dat <- structure(list(country_name = c("US", "US", "US", "US", "US", 
"US", "UK", "UK"), date = c("2016-11-02", "2017-09-12", "2017-09-19", 
"2020-02-10", "2021-03-10", "2021-05-02", "2016-11-02", "2017-09-12"
), column_1 = c(7.5, NA, NA, 10, NA, NA, NA, 0.5), column_2 = c(NA, 
NA, 8L, NA, NA, 3L, 2L, 3L), column_3 = c(NA, 9, 10, NA, 7.3, 
NA, NA, NA)), class = "data.frame", row.names = c(NA, -8L))
  • Related