Home > Net >  sum by rows with specific columns selected
sum by rows with specific columns selected

Time:10-20

in R data table I would like to do the sum by row according to selected columns. Example :

iris = data.table(iris[,-5])
cols = c("Petal.Length","Petal.Width")

I did it like that but I don't want to use the rowSums function :

    iris[, newSum := rowSums(.SD), by = .I, .SDcols = c("Petal.Length","Petal.Width")]

Does someone has a trick with to just sum the rows for the columns selected easily ?

Thx

CodePudding user response:

What's wrong with rowSums? It's the best way here, btw it could be better with base R:

iris$newSum <- rowSums(iris[, c("Petal.Length", "Petal.Width")])

> iris
     Sepal.Length Sepal.Width Petal.Length Petal.Width newSum
  1:          5.1         3.5          1.4         0.2    1.6
  2:          4.9         3.0          1.4         0.2    1.6
  3:          4.7         3.2          1.3         0.2    1.5
  4:          4.6         3.1          1.5         0.2    1.7
  5:          5.0         3.6          1.4         0.2    1.6
 ---                                                         
146:          6.7         3.0          5.2         2.3    7.5
147:          6.3         2.5          5.0         1.9    6.9
148:          6.5         3.0          5.2         2.0    7.2
149:          6.2         3.4          5.4         2.3    7.7
150:          5.9         3.0          5.1         1.8    6.9
> 

Or if you really hate and dislike rowSums:

iris$newSum <- apply(iris[, c("Petal.Length", "Petal.Width")], 1, sum)

CodePudding user response:

These don't use rowSums:

irisdt[, newSum := Reduce(` `, .SD), .SDcols = cols]

irisdt[, newSum := as.matrix(.SD) %*% rep(1, ncol(.SD)), .SDcols = cols]

irisdt[, newSum := eval(parse(text = paste(cols, collapse = " ")))]

irisdt[, newSum := apply(.SD, 1, sum), .SDcols = cols]

irisdt[, newSum := sum(.SD), by = 1:ncol(.SD), .SDcols = cols]

irisdt[, newSum := c(rep(1, ncol(.SD)) %*% t(.SD)), .SDcols = cols]

library(purrr)
irisdt[, newSum := pmap(.SD, sum), .SDcols = cols]

irisdt[, newSum := do.call("mapply", c(sum, .SD)), .SDcols = cols]

irisdt[, newSum := tapply(as.matrix(.SD), row(.SD), sum), .SDcols = cols]

Note

library(data.table)
irisdt <- data.table(iris)    

CodePudding user response:

This is not an answer of itself, just a comparison of those offered so far.

bench::mark(
  nimliug = iris[, newSum := rowSums(.SD), by = .I, .SDcols = c("Petal.Length","Petal.Width")],
  `nimliug mod` = iris[, newSum := rowSums(.SD), .SDcols = c("Petal.Length","Petal.Width")],
  `U12-Forward 1` = { iris$newSum <- rowSums(iris[, c("Petal.Length", "Petal.Width")]); iris; },
  `U12-Forward 2` = { iris$newSum <- apply(iris[, c("Petal.Length", "Petal.Width")], 1, sum); iris; },
  `G.G 1` = iris[, newSum := Reduce(` `, .SD), .SDcols = cols],
  `G.G 2` = iris[, newSum := as.matrix(.SD) %*% rep(1, ncol(.SD)), .SDcols = cols], 
  `G.G 3` = iris[, newSum := eval(parse(text = paste(cols, collapse=" ")))],
  `G.G 4` = iris[, newSum := apply(.SD, 1, sum), .SDcols = cols],
  `G.G 5` = iris[, newSum := sum(.SD), by = 1:nrow(iris), .SDcols = cols], 
  `G.G 6` = iris[, newSum := c(rep(1, ncol(.SD)) %*% t(.SD)), .SDcols = cols],
  `G.G 7` = iris[, newSum := purrr::pmap_dbl(.SD, sum), .SDcols = cols], 
  `G.G 7 mod` = iris[, newSum := do.call(mapply, c(list(sum), .SD)), .SDcols = cols],
  min_iterations = 1000
)
# # A tibble: 12 x 13
#    expression         min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                     memory                  time             gc                  
#    <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                     <list>                  <list>           <list>              
#  1 nimliug        425.4us  541.5us     1662.    52.4KB     0     1000     0   601.83ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [9 x 3]>  <bch:tm [1,000]> <tibble [1,000 x 3]>
#  2 nimliug mod    387.2us  481.3us     1964.    52.4KB     0     1000     0   509.12ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [9 x 3]>  <bch:tm [1,000]> <tibble [1,000 x 3]>
#  3 U12-Forward 1  169.8us  221.2us     4050.    45.7KB     3.25  1248     1   308.14ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [14 x 3]> <bch:tm [1,249]> <tibble [1,249 x 3]>
#  4 U12-Forward 2    377us    503us     1837.    50.5KB     0     1000     0   544.43ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [18 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
#  5 G.G 1          320.6us  508.5us     1889.    66.2KB     1.89   999     1   528.86ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [10 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
#  6 G.G 2          360.1us  392.4us     2275.    52.4KB     0     1138     0   500.21ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [9 x 3]>  <bch:tm [1,138]> <tibble [1,138 x 3]>
#  7 G.G 3          373.7us  443.4us     2148.    34.3KB     0     1074     0   499.96ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [8 x 3]>  <bch:tm [1,074]> <tibble [1,074 x 3]>
#  8 G.G 4          540.3us  598.7us     1472.    57.3KB     1.47   999     1   678.56ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [13 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
#  9 G.G 5           4.99ms    5.5ms      177.    51.2KB     1.43   992     8      5.61s <data.table[,5] [150 x 5]> <Rprofmem[,3] [11 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 10 G.G 6          377.5us  492.2us     1991.      56KB     0     1000     0   502.26ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [11 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 11 G.G 7          707.7us  866.9us     1127.    66.2KB     1.13   999     1   886.81ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [10 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 12 G.G 7 mod      460.1us  586.1us     1669.    54.5KB     1.67   999     1   598.62ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [12 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>

Benchmarks can certainly be evil, especially when the data the benchmark uses is not representative of real data (either in class or size/dimensions). However, from this it seems somewhat clear that rowSums by itself is clearly the fastest (high `itr/sec`) and close to the most memory-lean (low mem_alloc).

Since they all derive the same output (bench::mark defaults to check=TRUE, which ensures that all outputs are the same), I believe this is a reasonable comparison of strengths and such. From here, which makes the most sense? Code-goodness is not just about correct output, it's also about readability and maintainability, especially when future-self might not recall all context of why some obscure less-readable code was chosen over more direct and declarative code.

  • Related