Home > Blockchain >  Minimum value of a column based on NA's in other column by group
Minimum value of a column based on NA's in other column by group

Time:01-18

I have a large dataset with 40 variables for 20 countries where some columns have NAs in the first (and last) years. I would like to extract the first and last year for which there is no NA in a column for a given country.

The code below successfully gives the first and last year for which there is no NA in the GDP column for the UK.

min(sort(table$year[table$Country=="UK"][which(!is.na(table$GDP))]))
[1] "1959"
max(sort(table$year[table$Country=="UK"][which(!is.na(table$GDP))]))
[1] "2020"

However, this manual approach is not efficient. Therefore, I was wondering if this could be done for all countries and all variables using e.g. the tidyverse? I would like to achieve an output that shows the min and max year without NA'S for every country and every variable:

 Country  Variable  min  max
1      UK       GDP 1959 2020
2      FR       GDP 1980 2020
3      IT       GDP 1980 2020
4      UK Inflation 1990 2022
5      FR Inflation 2000 2022
6      IT Inflation 2000 2022

Thank you!

CodePudding user response:

You can pivot your data longer, remove rows where the value is NA, and take the min, max year:

library(dplyr)
library(tidyr)

pivot_longer(df, cols = -c(Country,year),names_to = "Variable") %>% 
  filter(!is.na(value)) %>% 
  group_by(Country, Variable) %>% 
  summarize(MinYear = min(year), MaxYear=max(year))

Note that this assumes that other than Country and year columns, all other columns are target variables of interest (i.e cols = -c(Country, year)). If this is not the case, there are other options, such as:

  • using a range of columns like thiscols = GDP:Inflation
  • place your target cols in a vector target_cols = c("GDP", "Inflation") and use cols = all_of(target_vars)
  • Related