Extracting the Latest Available Year Data-CodePudding

I have a fairly large annual dataset in long format and with very large amount of missing values. I am trying to extract the data points for each column for the latest available year.

Input:

ID	Year	x	y
1	2017	1	NA
1	2018	NA	NA
1	2019	3	NA
1	2020	NA	c

data.frame(ID=c(1,1,1,1),
           Year =c(2017, 2018, 2019, 2020),
           x=c(1, NA, 3, NA),
           y=c(NA, NA, NA, "c")
)

Output:

ID	x	y
1	3	c

data.frame(ID=1,
           x=3,
           y="c"
           )

Many thanks in advance for your help.

CodePudding user response：

You can try:

library(dplyr)
library(tidyr)    

dfx %>%
  pivot_longer(-c(ID, Year),
               values_transform = list(value = as.character)) %>% 
  na.omit() %>% 
  group_by(ID, name) %>% 
  filter(Year == max(Year)) %>% 
  dplyr::select(-Year) %>% 
  pivot_wider(values_from = value, names_from = name) 

# # A tibble: 1 x 3
# ID x     y    
# <dbl> <chr> <chr>
#   1     1 3     c

CodePudding user response：

Assuming the rows are sorted by year within ID (which is the case in the question's example -- sort it first using arrange(ID, Year) if not), remove Year, group by ID, fill in each remaining column and take the last row of the group.

library(dplyr, exclude = c("lag", "filter"))
library(tidyr)

DF %>%
  select(-Year) %>%
  group_by(ID) %>%
  fill %>%
  slice_tail %>%
  ungroup

giving:

# A tibble: 1 x 3
     ID     x y    
  <dbl> <dbl> <chr>
1     1     3 c

2) na.locf0 from zoo would also work and gives the same result.

library(dplyr, exclude = c("lag", "filter"))
library(zoo)

DF %>%
  select(-Year) %>%
  group_by(ID) %>%
  mutate(across(.fns = na.locf0)) %>%
  slice_tail %>%
  ungroup

CodePudding user response：

You may want to address a few things to your question to generate an appropriate response.

Separate the logic of the question into a Reprex. The question is a little unclear in how you want to get your output without selecting them manually.
Show/explain ways that you have attempted the problem as well so people don't waste their time or feel like you haven't tried. From what you have there i'll give it a go to see if anything helps you.

df <- data.frame(ID=c(1,1,1,1),
           Year =c(2017, 2018, 2019, 2020),
           x=c(1, NA, 3, NA),
           y=c(NA, NA, NA, "c")
)
# Remove year like in example?
df <- df %>% 
    select(., -Year) %>%
    filter(, !is.na(y))

# Get values you want?
> df.x <- df %>% 
    select(x) %>%
    na.omit() %>%
    as.double()
# Put together
df[2] <- df.x