Home > Blockchain >  Efficient way to de-duplicate a dataframe in R and create case weights
Efficient way to de-duplicate a dataframe in R and create case weights

Time:08-10

Problem

I have a very large data frame with ~60 columns and ~12M rows that I want to prep for training a model.

To speed-up the whole process I like to collapse the data frame, by getting only the unique rows as well as their case weights (how often they occur).

So going from this data frame:

V1 | V2 | V3
------------
A  | B  | A
A  | B  | A
C  | C  | A
C  | C  | A
C  | C  | A

to this data frame:

V1 | V2 | V3 | weights
----------------------
A  | B  | A  | 2
C  | C  | A  | 3

Current Approach

Currently I use dplyr logic to do this, which works but also takes forever:

library(dplyr)
df %>% group_by_all() %>% count()

Is there a better and especially computationally faster way? Maybe to even do this at reading in the csv and not after?

CodePudding user response:

Maybe your data is too big given the hardware but have you tried data.table?

library(data.table)
setDTthreads(0L)
setDT(df)[, .N, by = names(df)]

#        V1     V2     V3     N
#    <char> <char> <char> <int>
# 1:      A      B      A     2
# 2:      C      C      A     3

Reproducible data:

df = data.frame(
  V1 = c("A", "A", "C", "C", "C"), 
  V2 = c("B", "B", "C", "C", "C"), 
  V3 = c("A", "A", "A", "A", "A")
)

CodePudding user response:

A base R option using aggregate

> aggregate(n ~ ., cbind(df, n = 1), sum)
  V1 V2 V3 n
1  A  B  A 2
2  C  C  A 3
  • Related