Home > database >  How to do a new data frame of the latest value reported in each column?
How to do a new data frame of the latest value reported in each column?

Time:05-15

I've got a table like this:

country    continent      date         n_case         Ex         TD         TC
--------------------------------------------------------------------------------
 Italy      Europe      2022-02-24        6           NA          2          90
 Italy      Europe      2022-01-17        12          87          2          86
 USA        America     2022-02-23        NA          NA          3          65
 USA        America     2022-01-08        6           NA          5          67
 USA        America     2022-01-04        6           7           7          87
etc etc...

What I wish is a new data frame with one row per country that for each country will store as columns the country name (column = country) and continent (column = continent), and also the latest date reported for each value in the column(date, n_case, Ex, TD, TC):

the desirable data frame:

country    continent      date         n_case         Ex         TD         TC
--------------------------------------------------------------------------------
 Italy      Europe      2022-02-24        6           87          2          90
 USA        America     2022-02-23        6           7           3          65
 etc etc..

values to ignore are either NA or " " (blank)

thank you!

CodePudding user response:

With dplyr, you can sort the data by dates decreasingly and then select the first non-NA value in each column.

library(dplyr)

df %>%
  group_by(country, continent) %>%
  arrange(desc(date), .by_group = TRUE) %>% 
  summarise(across(everything(), ~ .x[!is.na(.x)][1])) %>%
  ungroup()

# # A tibble: 2 × 7
#   country continent date       n_case    Ex    TD    TC
#   <chr>   <chr>     <date>      <int> <int> <int> <int>
# 1 Italy   Europe    2022-02-24      6    87     2    90
# 2 USA     America   2022-02-23      6     7     3    65

Data
df <- structure(list(country = c("Italy", "Italy", "USA", "USA", "USA"),
continent = c("Europe", "Europe", "America", "America", "America"),
date = structure(c(19047, 19009, 19046, 19000, 18996), class = "Date"), 
n_case = c(6L, 12L, NA, 6L, 6L), Ex = c(NA, 87L, NA, NA, 7L),
TD = c(2L, 2L, 3L, 5L, 7L), TC = c(90L, 86L, 65L, 67L, 87L)),
row.names = c(NA, -5L), class = "data.frame")
  • Related