Suppose I have the following dataframe df:
df=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c(1,1,1,2,3),"2015"=c(1,1,1,2,3),"2016"=c(2,1,1,2,1), "2017"=c(2,1,1,2,1), "2018"=c(2,3,1,2,1) )
Now, I want to find out, for which ID the data has changed in which year. So for example, in 2016 ID 123100 has changed from 1 to 2. I would like to add new columns for change (1 = change, 0 = no change), year of change, old value (1,2 or 3) and new value (1,2,3).
In the end it should look like this:
df_final=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c(1,1,1,2,3),"2015"=c(1,1,1,2,3),"2016"=c(2,1,1,2,1), "2017"=c(2,1,1,2,1), "2018"=c(2,3,1,2,1), "change"=c(1,1,0,0,1),
"year"=c(2016, 2018, 0, 0, 2016), "before"=c(1,1,0,0,3), "after"=c(2, 3, 0, 0, 1))
I couldn't find any satisfying solution on here, so I hope you can help me.
CodePudding user response:
Here's a base R method.
It may be best to have the IDs with no change as NA
. If you really want zeroes, just change c(NA, NA, NA)
in the following code to c(0, 0, 0)
Note that in your example data frames, if you run the code as-is, the column names for each year all start with an "x" - you can prevent this by adding the check.names = FALSE
argument to the data.frame
function.
cbind(df, setNames(as.data.frame(t(apply(df[-1], 1, function(x) {
y <- which(diff(x) != 0)
if(length(y)) c(as.numeric(names(y)), x[y], x[y 1])
else c(NA, NA, NA)
}))), c("Year", "Before", "After")))
#> ID 2014 2015 2016 2017 2018 Year Before After
#> 1 123100 1 1 2 2 2 2016 1 2
#> 2 123200 1 1 1 1 3 2018 1 3
#> 3 123300 1 1 1 1 1 NA NA NA
#> 4 123400 2 2 2 2 2 NA NA NA
#> 5 123500 3 3 1 1 1 2016 3 1
Data used
df <- structure(list(ID = c(123100, 123200, 123300, 123400, 123500),
`2014` = c(1, 1, 1, 2, 3), `2015` = c(1, 1, 1, 2, 3), `2016` = c(2,
1, 1, 2, 1), `2017` = c(2, 1, 1, 2, 1), `2018` = c(2, 3,
1, 2, 1)), class = "data.frame", row.names = c(NA, -5L))
Created on 2022-06-18 by the reprex package (v2.0.1)
CodePudding user response:
here is an optional tidyverse
approach:
library(tidyverse)
# join resume df to current df
dplyr::left_join(df,
# make df long to build groupings by ID
tidyr::pivot_longer(df, -ID) %>%
dplyr::group_by(ID) %>%
# order just to be sure
dplyr::arrange(ID, name) %>%
# generate year number, before and after values
dplyr::mutate(year = readr::parse_number(name),
before = lag(value),
# if there is no after value use current value
after = ifelse(is.na(lead(value)), value, lead(value))) %>%
# filter where preceding uneven current
dplyr::filter(before != value) %>%
# unselect obsolete columns
dplyr::select(-name, -value),
by = "ID") %>%
# fill up empty fields with zeros
dplyr::mutate(dplyr::across(year:after, ~ifelse(is.na(.x), 0, .x)))
ID X2014 X2015 X2016 X2017 X2018 year before after
1 123100 1 1 2 2 2 2016 1 2
2 123200 1 1 1 1 3 2018 1 3
3 123300 1 1 1 1 1 0 0 0
4 123400 2 2 2 2 2 0 0 0
5 123500 3 3 1 1 1 2016 3 1
CodePudding user response:
matrixStats::rowDiffs
might be helpful and faster here.
z <- apply(matrixStats::rowDiffs(as.matrix(df[-1])) != 0, 1, which.max) 1; d <- dim(df)
m <- matrix(t(df[-1])[c(z 0:(d[2] - 2)*d[1] - 1, z 0:(d[2] - 2)*d[1])],,2, di=list(c(), c('before', 'after')))
cbind(df, `[<-`(cbind(change=1, year=substring(names(df[-1])[z], 2), m), z == 2, 1:4, 0))
# ID X2014 X2015 X2016 X2017 X2018 change year before after
# 1 123100 1 1 2 2 2 1 2016 1 2
# 2 123200 1 1 1 1 3 1 2018 1 3
# 3 123300 1 1 1 1 1 0 0 0 0
# 4 123400 2 2 2 2 2 0 0 0 0
# 5 123500 3 3 1 1 1 1 2016 3 1