Home > OS >  Extracting the Latest Available Year Data
Extracting the Latest Available Year Data

Time:03-18

I have a fairly large annual dataset in long format and with very large amount of missing values. I am trying to extract the data points for each column for the latest available year.

Input:

ID Year x y
1 2017 1 NA
1 2018 NA NA
1 2019 3 NA
1 2020 NA c
data.frame(ID=c(1,1,1,1),
           Year =c(2017, 2018, 2019, 2020),
           x=c(1, NA, 3, NA),
           y=c(NA, NA, NA, "c")
)

Output:

ID x y
1 3 c
data.frame(ID=1,
           x=3,
           y="c"
           )

Many thanks in advance for your help.

CodePudding user response:

You can try:

library(dplyr)
library(tidyr)    

dfx %>%
  pivot_longer(-c(ID, Year),
               values_transform = list(value = as.character)) %>% 
  na.omit() %>% 
  group_by(ID, name) %>% 
  filter(Year == max(Year)) %>% 
  dplyr::select(-Year) %>% 
  pivot_wider(values_from = value, names_from = name) 

# # A tibble: 1 x 3
# ID x     y    
# <dbl> <chr> <chr>
#   1     1 3     c  

CodePudding user response:

Assuming the rows are sorted by year within ID (which is the case in the question's example -- sort it first using arrange(ID, Year) if not), remove Year, group by ID, fill in each remaining column and take the last row of the group.

library(dplyr, exclude = c("lag", "filter"))
library(tidyr)

DF %>%
  select(-Year) %>%
  group_by(ID) %>%
  fill %>%
  slice_tail %>%
  ungroup

giving:

# A tibble: 1 x 3
     ID     x y    
  <dbl> <dbl> <chr>
1     1     3 c    

2) na.locf0 from zoo would also work and gives the same result.

library(dplyr, exclude = c("lag", "filter"))
library(zoo)

DF %>%
  select(-Year) %>%
  group_by(ID) %>%
  mutate(across(.fns = na.locf0)) %>%
  slice_tail %>%
  ungroup

CodePudding user response:

You may want to address a few things to your question to generate an appropriate response.

  • Separate the logic of the question into a Reprex. The question is a little unclear in how you want to get your output without selecting them manually.

  • Show/explain ways that you have attempted the problem as well so people don't waste their time or feel like you haven't tried. From what you have there i'll give it a go to see if anything helps you.

df <- data.frame(ID=c(1,1,1,1),
           Year =c(2017, 2018, 2019, 2020),
           x=c(1, NA, 3, NA),
           y=c(NA, NA, NA, "c")
)
# Remove year like in example?
df <- df %>% 
    select(., -Year) %>%
    filter(, !is.na(y))

# Get values you want?
> df.x <- df %>% 
    select(x) %>%
    na.omit() %>%
    as.double()
# Put together
df[2] <- df.x

  • Related