Efficient way to de-duplicate a dataframe in R and create case weights-CodePudding

Problem

I have a very large data frame with ~60 columns and ~12M rows that I want to prep for training a model.

To speed-up the whole process I like to collapse the data frame, by getting only the unique rows as well as their case weights (how often they occur).

So going from this data frame:

V1 | V2 | V3
------------
A  | B  | A
A  | B  | A
C  | C  | A
C  | C  | A
C  | C  | A

to this data frame:

V1 | V2 | V3 | weights
----------------------
A  | B  | A  | 2
C  | C  | A  | 3

Current Approach

Currently I use dplyr logic to do this, which works but also takes forever:

library(dplyr)
df %>% group_by_all() %>% count()

Is there a better and especially computationally faster way? Maybe to even do this at reading in the csv and not after?

CodePudding user response：

Maybe your data is too big given the hardware but have you tried data.table?

library(data.table)
setDTthreads(0L)
setDT(df)[, .N, by = names(df)]

#        V1     V2     V3     N
#    <char> <char> <char> <int>
# 1:      A      B      A     2
# 2:      C      C      A     3

Reproducible data:

df = data.frame(
  V1 = c("A", "A", "C", "C", "C"), 
  V2 = c("B", "B", "C", "C", "C"), 
  V3 = c("A", "A", "A", "A", "A")
)

CodePudding user response：

A base R option using aggregate

> aggregate(n ~ ., cbind(df, n = 1), sum)
  V1 V2 V3 n
1  A  B  A 2
2  C  C  A 3