Home > Software design >  How to track changes in rows/lines of data frame?
How to track changes in rows/lines of data frame?

Time:06-18

Suppose I have the following dataframe df:

df=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c(1,1,1,2,3),"2015"=c(1,1,1,2,3),"2016"=c(2,1,1,2,1), "2017"=c(2,1,1,2,1), "2018"=c(2,3,1,2,1) )

Now, I want to find out, for which ID the data has changed in which year. So for example, in 2016 ID 123100 has changed from 1 to 2. I would like to add new columns for change (1 = change, 0 = no change), year of change, old value (1,2 or 3) and new value (1,2,3).

In the end it should look like this:

df_final=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c(1,1,1,2,3),"2015"=c(1,1,1,2,3),"2016"=c(2,1,1,2,1), "2017"=c(2,1,1,2,1), "2018"=c(2,3,1,2,1), "change"=c(1,1,0,0,1),
          "year"=c(2016, 2018, 0, 0, 2016), "before"=c(1,1,0,0,3), "after"=c(2, 3, 0, 0, 1))

I couldn't find any satisfying solution on here, so I hope you can help me.

CodePudding user response:

Here's a base R method.

It may be best to have the IDs with no change as NA. If you really want zeroes, just change c(NA, NA, NA) in the following code to c(0, 0, 0)

Note that in your example data frames, if you run the code as-is, the column names for each year all start with an "x" - you can prevent this by adding the check.names = FALSE argument to the data.frame function.

cbind(df, setNames(as.data.frame(t(apply(df[-1], 1, function(x) {
  y <- which(diff(x) != 0) 
  if(length(y)) c(as.numeric(names(y)), x[y], x[y 1])
  else c(NA, NA, NA)
}))), c("Year", "Before", "After")))
#>       ID 2014 2015 2016 2017 2018 Year Before After
#> 1 123100    1    1    2    2    2 2016      1     2
#> 2 123200    1    1    1    1    3 2018      1     3
#> 3 123300    1    1    1    1    1   NA     NA    NA
#> 4 123400    2    2    2    2    2   NA     NA    NA
#> 5 123500    3    3    1    1    1 2016      3     1

Data used

df <- structure(list(ID = c(123100, 123200, 123300, 123400, 123500), 
    `2014` = c(1, 1, 1, 2, 3), `2015` = c(1, 1, 1, 2, 3), `2016` = c(2, 
    1, 1, 2, 1), `2017` = c(2, 1, 1, 2, 1), `2018` = c(2, 3, 
    1, 2, 1)), class = "data.frame", row.names = c(NA, -5L))

Created on 2022-06-18 by the reprex package (v2.0.1)

CodePudding user response:

here is an optional tidyverse approach:

library(tidyverse)

# join resume df to current df
dplyr::left_join(df, 
                 # make df long to build groupings by ID
                 tidyr::pivot_longer(df, -ID) %>% 
                     dplyr::group_by(ID) %>% 
                     # order just to be sure
                     dplyr::arrange(ID, name) %>%
                     # generate year number, before and after values
                     dplyr::mutate(year = readr::parse_number(name),
                                   before = lag(value), 
                                   # if there is no after value use current value
                                   after = ifelse(is.na(lead(value)), value, lead(value))) %>%
                     # filter where preceding uneven current
                     dplyr::filter(before != value) %>%
                     # unselect obsolete columns
                     dplyr::select(-name, -value),               
                 by = "ID") %>%
    # fill up empty fields with zeros
    dplyr::mutate(dplyr::across(year:after, ~ifelse(is.na(.x), 0, .x)))

      ID X2014 X2015 X2016 X2017 X2018 year before after
1 123100     1     1     2     2     2 2016      1     2
2 123200     1     1     1     1     3 2018      1     3
3 123300     1     1     1     1     1    0      0     0
4 123400     2     2     2     2     2    0      0     0
5 123500     3     3     1     1     1 2016      3     1

CodePudding user response:

matrixStats::rowDiffs might be helpful and faster here.

z <- apply(matrixStats::rowDiffs(as.matrix(df[-1])) != 0, 1, which.max)   1; d <- dim(df)
m <- matrix(t(df[-1])[c(z   0:(d[2] - 2)*d[1] - 1, z   0:(d[2] - 2)*d[1])],,2, di=list(c(), c('before', 'after')))
cbind(df, `[<-`(cbind(change=1, year=substring(names(df[-1])[z], 2), m), z == 2, 1:4, 0))
#       ID X2014 X2015 X2016 X2017 X2018 change year before after
# 1 123100     1     1     2     2     2      1 2016      1     2
# 2 123200     1     1     1     1     3      1 2018      1     3
# 3 123300     1     1     1     1     1      0    0      0     0
# 4 123400     2     2     2     2     2      0    0      0     0
# 5 123500     3     3     1     1     1      1 2016      3     1
  • Related