I have a large dataset with 40 variables for 20 countries where some columns have NAs in the first (and last) years. I would like to extract the first and last year for which there is no NA in a column for a given country.
The code below successfully gives the first and last year for which there is no NA in the GDP column for the UK.
min(sort(table$year[table$Country=="UK"][which(!is.na(table$GDP))]))
[1] "1959"
max(sort(table$year[table$Country=="UK"][which(!is.na(table$GDP))]))
[1] "2020"
However, this manual approach is not efficient. Therefore, I was wondering if this could be done for all countries and all variables using e.g. the tidyverse? I would like to achieve an output that shows the min and max year without NA'S for every country and every variable:
Country Variable min max
1 UK GDP 1959 2020
2 FR GDP 1980 2020
3 IT GDP 1980 2020
4 UK Inflation 1990 2022
5 FR Inflation 2000 2022
6 IT Inflation 2000 2022
Thank you!
CodePudding user response:
You can pivot your data longer, remove rows where the value is NA
, and take the min, max year:
library(dplyr)
library(tidyr)
pivot_longer(df, cols = -c(Country,year),names_to = "Variable") %>%
filter(!is.na(value)) %>%
group_by(Country, Variable) %>%
summarize(MinYear = min(year), MaxYear=max(year))
Note that this assumes that other than Country
and year
columns, all other columns are target variables of interest (i.e cols = -c(Country, year)
). If this is not the case, there are other options, such as:
- using a range of columns like this
cols = GDP:Inflation
- place your target cols in a vector
target_cols = c("GDP", "Inflation")
and usecols = all_of(target_vars)