I have a dataframe with millions of rows and tens of columns that I need to apply a rowwise operation. My solution below works using dplyr
but I hope a switch to data.table will speed things up. Any help converting the below code to a data.table
version would be appreciated.
library(tidyverse)
library(trend)
df = structure(list(id = 1:2, var = c(3L, 9L), col1_x = c("[(1,2,3)]",
"[(100,90,80,70,60,50,40,30,20)]"), col2_x = c("[(2,4,6)]", "[(100,50,25,12,6,3,1,1,1)]"
)), class = "data.frame", row.names = c(NA, -2L))
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
x_cols = df %>%
select(ends_with("x")) %>%
names()
df = df %>%
rowwise() %>%
mutate(across(all_of(x_cols) ,~ ifelse(var<=4,0,sens.slope(as.numeric(unlist(strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
CodePudding user response:
While what @Ritchie Sacramento wrote is absolutely true, here's the information you asked for.
First, I want to start with set
or :=
. When you see the keyword set
(which can just be part of the function name) or the :=
symbol, you've told data.table
not to make copies of the data. Without declaring or declaration (that pesky =
or <-
), you've changed the data table. This is one of the key methods to prevent wasted memory with this package.
Keep in mind that the environment pane in RStudio is triggered to update when it registers that operator (
=
or<-
), creating something new. Since you did a replace-in-place, the environment pane may reflect incorrect information. You can use therefresh
icon (top right of the pane), or you can print the object to the console to check.
As soon as you declare anything that the pane identifies, everything in the pane is updated.
Change a data frame to a data.table
. (Notice that keyword—set
!) Both of these do the same thing. However, one copies everything in memory and makes it again. (Naming the frame the same thing does not prevent copies.)
setDT(df)
df <- data.table(df)
I'm not going to start with your first code blurb. I'm starting with the name extraction.
You wrote:
x_cols = df %>%
select(ends_with("x")) %>%
names()
# [1] "col1_x" "col2_x"
There are many ways to get this information. This is what I did. Note that this doesn't really have anything to do with data.table
. I just used base R here. You could use a data frame the same way.
xcols <- names(df)[endsWith(names(df), 'x')]
# [1] "col1_x" "col2_x"
I'm going to use this object, xcols
in the remaining examples. (Why keep reiterating the same declaration?)
You wrote the following to remove the brackets and parentheses.
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
# id var col1_x col2_x
# 1 1 3 1,2,3 2,4,6
# 2 2 9 100,90,80,70,60,50,40,30,20 100,50,25,12,6,3,1,1,1
There are several ways you could do this, whether in a data frame or a data.table
. Here are a couple of methods you can use with data.table
. These do the exact same thing as each other and your code.
Note the :=
, which means the table changed.
In the first example, I used .SD
and .SDcols
. These are data column selection tools. You use .SD
in place of the column name when you want to use more than one column. Then use .SDcols
to tell data.table
what columns you're trying to use. By annotating (xcols)
, where xcols
is my variable representing my column names to use, this tells data.table
to replace the data in the columns used for the aggregation.
The difference between these two is how I used lapply
, which doesn't have anything to do with data.table
. If you need more info on this function, you can ask me, or you can look through the many Q & As out there already.
df[,
(xcols) := lapply(.SD, function(k) gsub("[][()]", "", k)),
.SDcols = xcols]
df[,
(xcols) := lapply(.SD, gsub, pattern = "[][()]",
replacement = ""),
.SDcols = xcols]
Your last request was based on this code.
df %>%
rowwise() %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
Since you used var
to delineate when to apply this, I've used the by
argument (as in dplyr
's group_by
). In terms of the other requirements, you'll see .SD
and lapply
again.
df[,
(xcols) := lapply(.SD,
function(k) {
ifelse(var <= 3, 0,
sens.slope(as.numeric(strsplit(k, ",")[[1]])
)$estimates[[1]])
}), by = var, .SDcols = xcols]
If you think about how these differ, you may find that, in a lot of ways, they aren't all that different. For example, in this last translation, you may see a similar approach in dplyr
that I used.
df %>% group_by(var) %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]])))