in R data table I would like to do the sum by row according to selected columns. Example :
iris = data.table(iris[,-5])
cols = c("Petal.Length","Petal.Width")
I did it like that but I don't want to use the rowSums function :
iris[, newSum := rowSums(.SD), by = .I, .SDcols = c("Petal.Length","Petal.Width")]
Does someone has a trick with to just sum the rows for the columns selected easily ?
Thx
CodePudding user response:
What's wrong with rowSums
? It's the best way here, btw it could be better with base R:
iris$newSum <- rowSums(iris[, c("Petal.Length", "Petal.Width")])
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width newSum
1: 5.1 3.5 1.4 0.2 1.6
2: 4.9 3.0 1.4 0.2 1.6
3: 4.7 3.2 1.3 0.2 1.5
4: 4.6 3.1 1.5 0.2 1.7
5: 5.0 3.6 1.4 0.2 1.6
---
146: 6.7 3.0 5.2 2.3 7.5
147: 6.3 2.5 5.0 1.9 6.9
148: 6.5 3.0 5.2 2.0 7.2
149: 6.2 3.4 5.4 2.3 7.7
150: 5.9 3.0 5.1 1.8 6.9
>
Or if you really hate and dislike rowSums
:
iris$newSum <- apply(iris[, c("Petal.Length", "Petal.Width")], 1, sum)
CodePudding user response:
These don't use rowSums:
irisdt[, newSum := Reduce(` `, .SD), .SDcols = cols]
irisdt[, newSum := as.matrix(.SD) %*% rep(1, ncol(.SD)), .SDcols = cols]
irisdt[, newSum := eval(parse(text = paste(cols, collapse = " ")))]
irisdt[, newSum := apply(.SD, 1, sum), .SDcols = cols]
irisdt[, newSum := sum(.SD), by = 1:ncol(.SD), .SDcols = cols]
irisdt[, newSum := c(rep(1, ncol(.SD)) %*% t(.SD)), .SDcols = cols]
library(purrr)
irisdt[, newSum := pmap(.SD, sum), .SDcols = cols]
irisdt[, newSum := do.call("mapply", c(sum, .SD)), .SDcols = cols]
irisdt[, newSum := tapply(as.matrix(.SD), row(.SD), sum), .SDcols = cols]
Note
library(data.table)
irisdt <- data.table(iris)
CodePudding user response:
This is not an answer of itself, just a comparison of those offered so far.
bench::mark(
nimliug = iris[, newSum := rowSums(.SD), by = .I, .SDcols = c("Petal.Length","Petal.Width")],
`nimliug mod` = iris[, newSum := rowSums(.SD), .SDcols = c("Petal.Length","Petal.Width")],
`U12-Forward 1` = { iris$newSum <- rowSums(iris[, c("Petal.Length", "Petal.Width")]); iris; },
`U12-Forward 2` = { iris$newSum <- apply(iris[, c("Petal.Length", "Petal.Width")], 1, sum); iris; },
`G.G 1` = iris[, newSum := Reduce(` `, .SD), .SDcols = cols],
`G.G 2` = iris[, newSum := as.matrix(.SD) %*% rep(1, ncol(.SD)), .SDcols = cols],
`G.G 3` = iris[, newSum := eval(parse(text = paste(cols, collapse=" ")))],
`G.G 4` = iris[, newSum := apply(.SD, 1, sum), .SDcols = cols],
`G.G 5` = iris[, newSum := sum(.SD), by = 1:nrow(iris), .SDcols = cols],
`G.G 6` = iris[, newSum := c(rep(1, ncol(.SD)) %*% t(.SD)), .SDcols = cols],
`G.G 7` = iris[, newSum := purrr::pmap_dbl(.SD, sum), .SDcols = cols],
`G.G 7 mod` = iris[, newSum := do.call(mapply, c(list(sum), .SD)), .SDcols = cols],
min_iterations = 1000
)
# # A tibble: 12 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 nimliug 425.4us 541.5us 1662. 52.4KB 0 1000 0 601.83ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [9 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 2 nimliug mod 387.2us 481.3us 1964. 52.4KB 0 1000 0 509.12ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [9 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 3 U12-Forward 1 169.8us 221.2us 4050. 45.7KB 3.25 1248 1 308.14ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [14 x 3]> <bch:tm [1,249]> <tibble [1,249 x 3]>
# 4 U12-Forward 2 377us 503us 1837. 50.5KB 0 1000 0 544.43ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [18 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 5 G.G 1 320.6us 508.5us 1889. 66.2KB 1.89 999 1 528.86ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [10 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 6 G.G 2 360.1us 392.4us 2275. 52.4KB 0 1138 0 500.21ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [9 x 3]> <bch:tm [1,138]> <tibble [1,138 x 3]>
# 7 G.G 3 373.7us 443.4us 2148. 34.3KB 0 1074 0 499.96ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [8 x 3]> <bch:tm [1,074]> <tibble [1,074 x 3]>
# 8 G.G 4 540.3us 598.7us 1472. 57.3KB 1.47 999 1 678.56ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [13 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 9 G.G 5 4.99ms 5.5ms 177. 51.2KB 1.43 992 8 5.61s <data.table[,5] [150 x 5]> <Rprofmem[,3] [11 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 10 G.G 6 377.5us 492.2us 1991. 56KB 0 1000 0 502.26ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [11 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 11 G.G 7 707.7us 866.9us 1127. 66.2KB 1.13 999 1 886.81ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [10 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 12 G.G 7 mod 460.1us 586.1us 1669. 54.5KB 1.67 999 1 598.62ms <data.table[,5] [150 x 5]> <Rprofmem[,3] [12 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
Benchmarks can certainly be evil, especially when the data the benchmark uses is not representative of real data (either in class or size/dimensions). However, from this it seems somewhat clear that rowSums
by itself is clearly the fastest (high `itr/sec`
) and close to the most memory-lean (low mem_alloc
).
Since they all derive the same output (bench::mark
defaults to check=TRUE
, which ensures that all outputs are the same), I believe this is a reasonable comparison of strengths and such. From here, which makes the most sense? Code-goodness is not just about correct output, it's also about readability and maintainability, especially when future-self might not recall all context of why some obscure less-readable code was chosen over more direct and declarative code.