I am somewhat new to data.table & trying to replicate my dplyr code to data.table but unable to get the same results.
libs
library(data.table)
library(lubridate)
library(tidyverse)
df
(Have not used any NA in this dummy data but need to filter out NA's)
test_df <- data.frame(id = c(1234, 1234, 5678, 5678),
date = c("2021-10-10","2021-10-10", "2021-8-10", "2021-8-15")) %>%
mutate(date = ymd(date))
dplyr code:
Find out ids that have more than one distinct date.
test_df %>%
group_by(id) %>%
filter(!is.na(date)) %>%
distinct(date) %>%
count(id) %>%
filter(n > 1)
id n
5678 2
data.table attempt:
test_dt <- setDT(test_df)
test_dt[!is.na(date), by = id][
,keyby = .(date)][
,.N, by = id][
N > 1
]
CodePudding user response:
The distinct
in dplyr
can be unique
in data.table
with by
option
unique(setDT(test_df)[!is.na(date)], by = c("id", "date"))[, .N, by = id][N > 1]
id N
1: 5678 2
Steps are as follows
- Convert to data.table (
setDT
) - Remove the rows with
NA
from 'date' (!is.na(date)
) - Get the
unique
rowsby
the 'id' and 'date' column - Do a group by 'id' to get the count (
.N
) - Finally, filter the rows where count is greater than 1