I've got a table like this:
country continent date n_case Ex TD TC
--------------------------------------------------------------------------------
Italy Europe 2022-02-24 6 NA 2 90
Italy Europe 2022-01-17 12 87 2 86
USA America 2022-02-23 NA NA 3 65
USA America 2022-01-08 6 NA 5 67
USA America 2022-01-04 6 7 7 87
etc etc...
What I wish is a new data frame with one row per country that for each country will store as columns the country name (column = country) and continent (column = continent), and also the latest date reported for each value in the column(date, n_case, Ex, TD, TC):
the desirable data frame:
country continent date n_case Ex TD TC
--------------------------------------------------------------------------------
Italy Europe 2022-02-24 6 87 2 90
USA America 2022-02-23 6 7 3 65
etc etc..
values to ignore are either NA or " " (blank)
thank you!
CodePudding user response:
With dplyr
, you can sort the data by dates decreasingly and then select the first non-NA value in each column.
library(dplyr)
df %>%
group_by(country, continent) %>%
arrange(desc(date), .by_group = TRUE) %>%
summarise(across(everything(), ~ .x[!is.na(.x)][1])) %>%
ungroup()
# # A tibble: 2 × 7
# country continent date n_case Ex TD TC
# <chr> <chr> <date> <int> <int> <int> <int>
# 1 Italy Europe 2022-02-24 6 87 2 90
# 2 USA America 2022-02-23 6 7 3 65
Data
df <- structure(list(country = c("Italy", "Italy", "USA", "USA", "USA"),
continent = c("Europe", "Europe", "America", "America", "America"),
date = structure(c(19047, 19009, 19046, 19000, 18996), class = "Date"),
n_case = c(6L, 12L, NA, 6L, 6L), Ex = c(NA, 87L, NA, NA, 7L),
TD = c(2L, 2L, 3L, 5L, 7L), TC = c(90L, 86L, 65L, 67L, 87L)),
row.names = c(NA, -5L), class = "data.frame")