Home > Back-end >  R: How to repeatedly subtract specific columns from different series of columns, and output to a new
R: How to repeatedly subtract specific columns from different series of columns, and output to a new

Time:02-10

I have a dataframe in wide format, and I want to subtract specific columns from different series of columns. Ideally I'd like the results to be in a new dataframe.

For example: From this sample dataframe (dfOld), I would like columns A, B and C to each subtract D, and columns E, F and G to each subtract column H. In the real dataset, this keeps going and needs to be iterated.

image of dfOld as table

Sample Data:

dfOld <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), A = c(2, 3, 4,5,4,6,7,1,9,12), B = c(3, 4, 5,2,4,5,1,7,0,8), C = c(5, 6, 7,2,4,1,5,4,6,13), D = c(68, 7, 8,2,1,5,7,9,78,7), E = c(2, 3, 42,5,4,6,7,1,9,12), F = c(37, 4, 5,2,48,5,1,7,60,8), G = c(5, 6, 7,2,4,1,5,4,6,13), H = c(35, 7, 8,2,1,5,7,9,78,7))

The results would ideally be in a new dataframe, with columns that have values and names for A-D, B-D, C-D, E-H, F-H, G-H, and look like this:

image of dfNew as table

In Excel, the formula would be "=B2-$E2" dragged down the rows, and across 3 columns, and then repeated again for "F2-$I2" etc, using the "$" sign to lock the column

In R, I've only been able to do this manually, kind of like the answer previously posted for a similar question (Subtracting two columns to give a new column in R)

dfOld$A-D<-(dfOld$A-dfOld$D)
dfOld$B-D<-(dfOld$B-dfOld$D)
dfOld$C-D<-(dfOld$C-dfOld$D)
dfOld$E-H<-(dfOld$E-dfOld$H)
dfOld$F-H<-(dfOld$F-dfOld$H)
dfOld$G-H<-(dfOld$G-dfOld$H)

And then separated the new columns out into a new dataset.

However, this obviously isn't scalable for my much larger dataset, and I'd really like to learn how else to do this kind of operation that's so easy in Excel(although still not scalable for large datasets).

Part of the answer may already be here: Subtract a column in a dataframe from many columns in R But this answer (an several other similar ones) changes the values in the same dataframe, and the columns keep the same names. I haven't been able to adapt it so that the new values have new columns, with new names (and ideally in a new dataframe)

Another part of the answer may be here: Iterative function to subtract columns from a specific column in a dataframe and have the values appear in a new column These answers put the subtracted results in new columns with new names, but every column in this dataframe subtracts values of every other column (A,B,C,D,E,F,G,H each minus C). And I can't seem to adapt it so that it works over specific series of columns (A, B, C each minus D, then E, F, G each minus H, etc.)

Thanks in advance for your help.

CodePudding user response:

Probably others have better ways - but here is one possibility.

  1. load two libraries and set dfOld to data.table
library(data.table)
library(magrittr)
setDT(dfOld)
  1. get information about the columns, and make into a list.
lv = names(dfOld)[-1][seq(1,ncol(dfOld)-1)%%4>0]
lv = split(lv, ceiling(seq_along(lv)/3))
names(lv) = names(dfOld)[-1][seq(1,ncol(dfOld)-1)%%4==0]

lv looks like this:

> lv
$D
[1] "A" "B" "C"

$H
[1] "E" "F" "G"
  1. This is a bit convoluted, but basically, I'm taking each of the elements of the lv list, and I'm reshaping columns from dfOld, so I can do all subtractions at once. Then I'm retaining only the variables I need, and binding each of the resulting list of data.tables into a single datatable using rbindlist
res =rbindlist(lapply(names(lv), function(x)  {
  melt(dfOld,id=c("ID", x),measure.vars = lv[[x]]) %>% 
    .[,`:=`(nc=value-get(x),variable=paste0(variable,"-",x))] %>%
    .[,.(ID,variable,nc)]
}))
  1. Last step is simple - just dcast back
dcast(res,ID~variable, value.var="nc")

Output

    ID A-D B-D C-D E-H F-H G-H
 1:  1 -66 -65 -63 -33   2 -30
 2:  2  -4  -3  -1  -4  -3  -1
 3:  3  -4  -3  -1  34  -3  -1
 4:  4   3   0   0   3   0   0
 5:  5   3   3   3   3  47   3
 6:  6   1   0  -4   1   0  -4
 7:  7   0  -6  -2   0  -6  -2
 8:  8  -8  -2  -5  -8  -2  -5
 9:  9 -69 -78 -72 -69 -18 -72
10: 10   5   1   6   5   1   6

CodePudding user response:

First, I create a function to do the simple calculation, where we have the dataframe, then the column names as the inputs. Then, I use purrr map2 to pass the function (which I replicate for the number of times needed, which in this case is 6). Then, I provide the list of parameters to apply that function for each column pair. Then, I use invoke to apply the function and parameter. Now, we are left with a list of dataframes (as the output is an individual column with the ID). Then, I use reduce` to combine them back into one dataframe, then update the column names.

library(tidyverse)

subtract <- function(x, a, b){
  x %>% 
    mutate(!! a  := !!rlang::parse_expr(a) - !!rlang::parse_expr(b)) %>% 
    dplyr::select(ID, which(colnames(x)==a))
}

col_names <- c("ID", "A-D", "B-D", "C-D", "E-H", "F-H", "G-H")

map2(
  flatten(list(rep(list(
    subtract
  ), 6))),
  list(
    expression(a = "A", b = "D"),
    expression(a = "B", b = "D"),
    expression(a = "C", b = "D"),
    expression(a = "E", b = "H"),
    expression(a = "F", b = "H"),
    expression(a = "G", b = "H")
  ),
  ~ invoke(.x, c(list(dfOld), as.list(.y)))
) %>%
  reduce(left_join, by = "ID") %>% 
  set_names(col_names)

Output

   ID A-D B-D C-D E-H F-H G-H
1   1 -66 -65 -63 -33   2 -30
2   2  -4  -3  -1  -4  -3  -1
3   3  -4  -3  -1  34  -3  -1
4   4   3   0   0   3   0   0
5   5   3   3   3   3  47   3
6   6   1   0  -4   1   0  -4
7   7   0  -6  -2   0  -6  -2
8   8  -8  -2  -5  -8  -2  -5
9   9 -69 -78 -72 -69 -18 -72
10 10   5   1   6   5   1   6

Data

dfOld <- structure(
  list(
    ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
    A = c(2,
          3, 4, 5, 4, 6, 7, 1, 9, 12),
    B = c(3, 4, 5, 2, 4, 5, 1, 7, 0,
          8),
    C = c(5, 6, 7, 2, 4, 1, 5, 4, 6, 13),
    D = c(68, 7, 8, 2,
          1, 5, 7, 9, 78, 7),
    E = c(2, 3, 42, 5, 4, 6, 7, 1, 9, 12),
    F = c(37,
          4, 5, 2, 48, 5, 1, 7, 60, 8),
    G = c(5, 6, 7, 2, 4, 1, 5, 4, 6,
          13),
    H = c(35, 7, 8, 2, 1, 5, 7, 9, 78, 7)
  ),
  class = "data.frame",
  row.names = c(NA,-10L)
)
  • Related