How can I map values from one data.table to another, based on their distribution?
I created some sample data below.
In my small_data
, I have an overview with dates and how often they occur (expressed in count [N] and percentage [perc]). Next to that, I have a big_data
table, which untill now only consists of ID
values.
My goal is to map the distribution of dates from the small_data
to big_data
, such that 2.3% of my IDs will get date 2017-04-01
, 2.3% will get date 2019-01-01
and 2.6% will get date 2019-01-01
etc...
The percentages do not have to be this exact, if it's roughly the same, that is fine.
set.seed(123)
#create small data (source)
small_data <- data.table(
date = sample(seq(as.Date('2016-01-01'), as.Date('2019-12-31'), by="quarter"), replace=TRUE, size = 20),
N = sample(c(1:30), replace=TRUE, size=20)
)[order(N)]
#calculate percentage
small_data[, perc := 100* N/small_data[, sum(N)]]
> small_data
date N perc
1: 2017-04-01 8 2.318841
2: 2019-01-01 8 2.318841
3: 2019-01-01 9 2.608696
4: 2018-01-01 9 2.608696
5: 2016-07-01 10 2.898551
6: 2017-10-01 12 3.478261
7: 2018-01-01 13 3.768116
8: 2017-01-01 14 4.057971
9: 2018-10-01 14 4.057971
10: 2019-07-01 14 4.057971
11: 2016-10-01 14 4.057971
12: 2019-04-01 15 4.347826
13: 2016-10-01 20 5.797101
14: 2019-07-01 22 6.376812
15: 2018-10-01 24 6.956522
16: 2017-07-01 27 7.826087
17: 2016-01-01 27 7.826087
18: 2017-10-01 27 7.826087
19: 2019-10-01 29 8.405797
20: 2016-10-01 29 8.405797
#create big data (target)
big_data <- data.table(
ID = c(1:100)
)
> head(big_data)
ID
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
CodePudding user response:
You can do this quite simply with sample, as @Roland has pointed out in his comment. Here is a reproducible example using your data:
library(data.table)
## Data set up ----------------------------------------------------------------
set.seed(123)
small_data <- data.table(
date = sample(seq(as.Date('2016-01-01'), as.Date('2019-12-31'), by="quarter"), replace=TRUE, size = 20),
N = sample(c(1:30), replace=TRUE, size=20)
)[order(N)]
small_data[, perc := 100* N/small_data[, sum(N)]]
#create big data (target)
big_data <- data.table(
ID = c(1:100)
)
## Generate dates for big_data based on % occurence in small_data -------------
n_dates <- nrow(big_data)
### Use small_dates[["perc"]] as the probability vector in sample()
sampled_dates <-
sample(small_data[["date"]], size = n_dates,
prob = small_data[["perc"]], replace = TRUE)
big_data[["date"]] <- sampled_dates
head(big_data)
#> ID date
#> 1: 1 2016-07-01
#> 2: 2 2019-07-01
#> 3: 3 2019-07-01
#> 4: 4 2019-07-01
#> 5: 5 2016-07-01
#> 6: 6 2016-07-01
Created on 2022-12-14 with reprex v2.0.2