Problem
I have a very large data frame with ~60 columns and ~12M rows that I want to prep for training a model.
To speed-up the whole process I like to collapse the data frame, by getting only the unique rows as well as their case weights (how often they occur).
So going from this data frame:
V1 | V2 | V3
------------
A | B | A
A | B | A
C | C | A
C | C | A
C | C | A
to this data frame:
V1 | V2 | V3 | weights
----------------------
A | B | A | 2
C | C | A | 3
Current Approach
Currently I use dplyr
logic to do this, which works but also takes forever:
library(dplyr)
df %>% group_by_all() %>% count()
Is there a better and especially computationally faster way? Maybe to even do this at reading in the csv and not after?
CodePudding user response:
Maybe your data is too big given the hardware but have you tried data.table
?
library(data.table)
setDTthreads(0L)
setDT(df)[, .N, by = names(df)]
# V1 V2 V3 N
# <char> <char> <char> <int>
# 1: A B A 2
# 2: C C A 3
Reproducible data:
df = data.frame(
V1 = c("A", "A", "C", "C", "C"),
V2 = c("B", "B", "C", "C", "C"),
V3 = c("A", "A", "A", "A", "A")
)
CodePudding user response:
A base R option using aggregate
> aggregate(n ~ ., cbind(df, n = 1), sum)
V1 V2 V3 n
1 A B A 2
2 C C A 3