I have a dataframe having country_name
, date
and several columns: column_1
, column_2
and column_3
. I am trying to extract the latest record based on date across several columns.
The dataframe looks like this:
| country_name | date | column_1| column_2| column_3|
| US | 2016-11-02 | 7.5 | NA | NA |
| US | 2017-09-12 | NA | NA | 9 |
| US | 2017-09-19 | NA | 8 | 10 |
| US | 2020-02-10 | 10 | NA | NA |
| US | 2021-03-10 | NA | NA | 7.3 |
| US | 2021-05-02 | NA | 3 | NA |
| UK | 2016-11-02 | NA | 2 | NA |
| UK | 2017-09-12 | 0.5 | 3 | NA |
.
.
For the US the desired output is:
| country_name | column_1| column_2| column_3|
| US | 10 | 3 | 7.3 |
For column_1
, the value with the latest date is 10 (date: 2020-02-10),
for column_2
is 3 (date: 2021-05-02), and for column_3
is 7.3 (date: 2021-03-10). My goal is to apply this logic across several countries. How do I achieve this?
CodePudding user response:
library(dplyr)
library(tidyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(country_name) %>%
arrange(date) %>%
select(-date) %>%
fill(everything()) %>%
slice(n())
#> # A tibble: 2 x 4
#> # Groups: country_name [2]
#> country_name column_1 column_2 column_3
#> <chr> <dbl> <int> <dbl>
#> 1 UK 0.5 3 NA
#> 2 US 10 3 7.3
Data:
read.table(text = "country_name date column_1 column_2 column_3
US 2016-11-02 7.5 NA NA
US 2017-09-12 NA NA 9
US 2017-09-19 NA 8 10
US 2020-02-10 10 NA NA
US 2021-03-10 NA NA 7.3
US 2021-05-02 NA 3 NA
UK 2016-11-02 NA 2 NA
UK 2017-09-12 0.5 3 NA",
header = T, stringsAsFactors = F) -> df1
CodePudding user response:
Here is how we could do it:
- If necessary transform
date
column to date class withymd()
function fromlubridate
. - group by
country_name
- Now comes the trick we use
across
for col1 col2... etc. and collapse in reverse withpaste(rev(.)....
to get the last value to first place. This is important for the next step. - Use
parse_number()
fromreadr
package that will extract the first number!
library(dplyr)
library(lubridate)
library(readr)
df %>%
mutate(date = ymd(date)) %>%
group_by(country_name) %>%
arrange(date, .by_group = TRUE) %>%
summarise(across(starts_with("column"), ~paste(rev(.), collapse = ' '))) %>%
mutate(across(-country_name, parse_number))
country_name column_1 column_2 column_3
<chr> <dbl> <dbl> <dbl>
1 UK 0.5 3 NA
2 US 10 3 7.3
CodePudding user response:
You could na.omit
and rev
erse each column and take first el
ement. Then rbind
. Take care of the right order
and if it's as.Date
formatted.
by(transform(dat, date=as.Date(date)), dat$country_name, \(x) {
cbind(x[1, 1, drop=FALSE],
lapply(x[order(x$date), 3:5], \(z) {
z <- el(rev(na.omit(z)))
ifelse(length(z) == 1, z, NA_real_)
}))
}) |> c(make.row.names=FALSE) |> do.call(what=rbind)
# country_name column_1 column_2 column_3
# 1 UK 0.5 3 NA
# 2 US 10.0 3 7.3
Data:
dat <- structure(list(country_name = c("US", "US", "US", "US", "US",
"US", "UK", "UK"), date = c("2016-11-02", "2017-09-12", "2017-09-19",
"2020-02-10", "2021-03-10", "2021-05-02", "2016-11-02", "2017-09-12"
), column_1 = c(7.5, NA, NA, 10, NA, NA, NA, 0.5), column_2 = c(NA,
NA, 8L, NA, NA, 3L, 2L, 3L), column_3 = c(NA, 9, 10, NA, 7.3,
NA, NA, NA)), class = "data.frame", row.names = c(NA, -8L))