Home > database >  How to chain group_by, filter, distinct, count in data.table?
How to chain group_by, filter, distinct, count in data.table?

Time:10-14

I am somewhat new to data.table & trying to replicate my dplyr code to data.table but unable to get the same results.

libs

library(data.table)
library(lubridate)
library(tidyverse)

df

(Have not used any NA in this dummy data but need to filter out NA's)

test_df <- data.frame(id = c(1234, 1234, 5678, 5678),
           date = c("2021-10-10","2021-10-10", "2021-8-10", "2021-8-15")) %>% 
  
  mutate(date = ymd(date))

dplyr code:

Find out ids that have more than one distinct date.

test_df %>%
  group_by(id) %>%
  filter(!is.na(date)) %>% 
  distinct(date) %>% 
  count(id) %>% 
  filter(n > 1)
id      n

5678    2   

data.table attempt:

test_dt <- setDT(test_df)

test_dt[!is.na(date), by = id][
    ,keyby = .(date)][
      ,.N, by = id][
        N > 1
        ]

CodePudding user response:

The distinct in dplyr can be unique in data.table with by option

unique(setDT(test_df)[!is.na(date)], by = c("id", "date"))[, .N, by = id][N > 1]
     id N
1: 5678 2

Steps are as follows

  1. Convert to data.table (setDT)
  2. Remove the rows with NA from 'date' (!is.na(date))
  3. Get the unique rows by the 'id' and 'date' column
  4. Do a group by 'id' to get the count (.N)
  5. Finally, filter the rows where count is greater than 1
  • Related