Home > Enterprise >  Rolling subtraction across columns (not rows!) in R
Rolling subtraction across columns (not rows!) in R

Time:12-09

I've looked around for a similar question but all of the ones I've found wanted to do a rolling subtraction by rows.

What I want to do, is a rolling subtraction across my dataframe columns. Specifically, I would like to subtract each column sequentially (from left to right), while keeping the current subtracted cumulative, as the "total" column to subtract from in the next sequence.

I've found a way to hard-code this, but obviously it looks ugly and the code would break if the number of columns differ in any way from the number of dfs created.

Let's say we have a dataframe of a population for each age for each year, with the total being the rowSums of each year:

df <- data.frame(Age <- c(1:40), 
                 Total <- rep(500,40), 
                 Y1990 <- rep(100,40), 
                 Y1991 <- rep(100,40),
                 Y1992 <- rep(100,40))

The result I want was achieved by the following code:

df1 <- df$Total  #or df[2]
df2 <- df1 - df[3]
df3 <- df2 - df[4]
...
dfx <- df(x-1) - df[x 1]

#and then we join them together like so:
final_df <- cbind(df$Age, df1, df2, df3,..., dfx)

#final_df should be the Age column, the Total column (500), df2 should be 400 (500-100 = 400), df3 should be 300, etc. etc.)

I fiddled around with loops but couldn't quite get the first/last iteration to work (the x 1/x-1 portion kept giving me an error that subscript was out of range). I even tried using "break" or "next" in the loop but I couldn't quite grasp it. I have about 70 years of data, and possibly more in the future, so I need to update my code to make it future-proof so as to not have hundreds of lines of "dfx" code.

I'm wondering if anybody could provide a super simple loop or function to go about this. Perhaps a data.table solution is easiest, though I have a hard time with data.table syntax. Bonus points if you can keep the variable name throughout the iteration (not necessary though). I just want my code to be pretty and robust! Cheers and thank you.

CodePudding user response:

I think this is what you want. There is no need for 40 identical rows, 5 should be enough:

df <- data.frame(Age = c(1:5), Total = rep(500, 5), Y1990 = rep(100, 5), Y1991 = rep(100, 5), Y1992 = rep(100, 5))

final_df <- data.frame(df[, 1:2], df$Total - t(apply(df[, 3:5], 1, cumsum)))
colnames(final_df)[-(1:2)] <- c("df2", "df3", "df4")
final_df
#   Age Total df2 df3 df4
# 1   1   500 400 300 200
# 2   2   500 400 300 200
# 3   3   500 400 300 200
# 4   4   500 400 300 200
# 5   5   500 400 300 200

CodePudding user response:

Here's a solution with data.table:

library(data.table)
df <- data.frame(Age = c(1:5), Total = rep(500, 5), Y1990 = rep(100, 5), Y1991 = rep(100, 5), Y1992 = rep(100, 5))
setDT(df)
final_df <- cbind(df[, .(Age = Age)], 
                  df[, Reduce(`-`, .SD, init = Total, accumulate = TRUE), 
                     .SDcols = Y1990:Y1992])
final_df
  Age  V1  V2  V3  V4
1:   1 500 400 300 200
2:   2 500 400 300 200
3:   3 500 400 300 200
4:   4 500 400 300 200
5:   5 500 400 300 200

CodePudding user response:

Various ways to go about this:

cbind(df[1], matrixStats::rowCumsums(as.matrix(df[-1])))
  Age   1   2   3   4
1   1 500 600 700 800
2   2 500 600 700 800
3   3 500 600 700 800
4   4 500 600 700 800
5   5 500 600 700 800


cbind(df[1], list2DF(Reduce('-', df[-1], accumulate = TRUE)))

  Age Var.2 Var.3 Var.4 Var.5
1   1   500   400   300   200
2   2   500   400   300   200
3   3   500   400   300   200
4   4   500   400   300   200
5   5   500   400   300   200
  • Related