I have a fairly large annual dataset in long format and with very large amount of missing values. I am trying to extract the data points for each column for the latest available year.
Input:
ID | Year | x | y |
---|---|---|---|
1 | 2017 | 1 | NA |
1 | 2018 | NA | NA |
1 | 2019 | 3 | NA |
1 | 2020 | NA | c |
data.frame(ID=c(1,1,1,1),
Year =c(2017, 2018, 2019, 2020),
x=c(1, NA, 3, NA),
y=c(NA, NA, NA, "c")
)
Output:
ID | x | y |
---|---|---|
1 | 3 | c |
data.frame(ID=1,
x=3,
y="c"
)
Many thanks in advance for your help.
CodePudding user response:
You can try:
library(dplyr)
library(tidyr)
dfx %>%
pivot_longer(-c(ID, Year),
values_transform = list(value = as.character)) %>%
na.omit() %>%
group_by(ID, name) %>%
filter(Year == max(Year)) %>%
dplyr::select(-Year) %>%
pivot_wider(values_from = value, names_from = name)
# # A tibble: 1 x 3
# ID x y
# <dbl> <chr> <chr>
# 1 1 3 c
CodePudding user response:
Assuming the rows are sorted by year within ID (which is the case in the question's example -- sort it first using arrange(ID, Year)
if not), remove Year
, group by ID
, fill in each remaining column and take the last row of the group.
library(dplyr, exclude = c("lag", "filter"))
library(tidyr)
DF %>%
select(-Year) %>%
group_by(ID) %>%
fill %>%
slice_tail %>%
ungroup
giving:
# A tibble: 1 x 3
ID x y
<dbl> <dbl> <chr>
1 1 3 c
2) na.locf0
from zoo would also work and gives the same result.
library(dplyr, exclude = c("lag", "filter"))
library(zoo)
DF %>%
select(-Year) %>%
group_by(ID) %>%
mutate(across(.fns = na.locf0)) %>%
slice_tail %>%
ungroup
CodePudding user response:
You may want to address a few things to your question to generate an appropriate response.
Separate the logic of the question into a Reprex. The question is a little unclear in how you want to get your output without selecting them manually.
Show/explain ways that you have attempted the problem as well so people don't waste their time or feel like you haven't tried. From what you have there i'll give it a go to see if anything helps you.
df <- data.frame(ID=c(1,1,1,1),
Year =c(2017, 2018, 2019, 2020),
x=c(1, NA, 3, NA),
y=c(NA, NA, NA, "c")
)
# Remove year like in example?
df <- df %>%
select(., -Year) %>%
filter(, !is.na(y))
# Get values you want?
> df.x <- df %>%
select(x) %>%
na.omit() %>%
as.double()
# Put together
df[2] <- df.x