Home > Software engineering >  Map values based on distribution from one dataset to another in R (data.table)
Map values based on distribution from one dataset to another in R (data.table)

Time:12-14

How can I map values from one data.table to another, based on their distribution?

I created some sample data below. In my small_data, I have an overview with dates and how often they occur (expressed in count [N] and percentage [perc]). Next to that, I have a big_data table, which untill now only consists of ID values.

My goal is to map the distribution of dates from the small_data to big_data, such that 2.3% of my IDs will get date 2017-04-01, 2.3% will get date 2019-01-01 and 2.6% will get date 2019-01-01 etc... The percentages do not have to be this exact, if it's roughly the same, that is fine.

set.seed(123)

#create small data (source)
small_data <- data.table(
  date = sample(seq(as.Date('2016-01-01'), as.Date('2019-12-31'), by="quarter"), replace=TRUE, size = 20),
  N = sample(c(1:30), replace=TRUE, size=20)
)[order(N)]

#calculate percentage
small_data[, perc := 100* N/small_data[, sum(N)]]

> small_data
          date  N     perc
 1: 2017-04-01  8 2.318841
 2: 2019-01-01  8 2.318841
 3: 2019-01-01  9 2.608696
 4: 2018-01-01  9 2.608696
 5: 2016-07-01 10 2.898551
 6: 2017-10-01 12 3.478261
 7: 2018-01-01 13 3.768116
 8: 2017-01-01 14 4.057971
 9: 2018-10-01 14 4.057971
10: 2019-07-01 14 4.057971
11: 2016-10-01 14 4.057971
12: 2019-04-01 15 4.347826
13: 2016-10-01 20 5.797101
14: 2019-07-01 22 6.376812
15: 2018-10-01 24 6.956522
16: 2017-07-01 27 7.826087
17: 2016-01-01 27 7.826087
18: 2017-10-01 27 7.826087
19: 2019-10-01 29 8.405797
20: 2016-10-01 29 8.405797


#create big data (target)
big_data <-  data.table(
  ID = c(1:100)
)

> head(big_data)
   ID
1:  1
2:  2
3:  3
4:  4
5:  5
6:  6

CodePudding user response:

You can do this quite simply with sample, as @Roland has pointed out in his comment. Here is a reproducible example using your data:

library(data.table)
## Data set up ----------------------------------------------------------------
set.seed(123)
small_data <- data.table(
  date = sample(seq(as.Date('2016-01-01'), as.Date('2019-12-31'), by="quarter"), replace=TRUE, size = 20),
  N = sample(c(1:30), replace=TRUE, size=20)
)[order(N)]
small_data[, perc := 100* N/small_data[, sum(N)]]

#create big data (target)
big_data <-  data.table(
  ID = c(1:100)
)

## Generate dates for big_data based on % occurence in small_data -------------
n_dates <- nrow(big_data)

### Use small_dates[["perc"]] as the probability vector in sample()
sampled_dates <- 
  sample(small_data[["date"]], size = n_dates,
         prob = small_data[["perc"]], replace = TRUE)

big_data[["date"]] <- sampled_dates
head(big_data)
#>    ID       date
#> 1:  1 2016-07-01
#> 2:  2 2019-07-01
#> 3:  3 2019-07-01
#> 4:  4 2019-07-01
#> 5:  5 2016-07-01
#> 6:  6 2016-07-01

Created on 2022-12-14 with reprex v2.0.2

  • Related