Home > OS >  data table: for each row generate random values from other table
data table: for each row generate random values from other table

Time:05-23

I would like to simulate flight numbers by generating flight numbers from a simulation table (table_simul), by taking an observation table as a basis.

table_simul <- data.table(
  date_f = c("2020-01-01","2020-01-02","2020-01-03","2020-01-03"),
  city = c("Paris","Paris","London", "Berlin")
)

table_obs <- data.table(
  city = c("Paris","Paris","Paris","London","London", "Berlin"),
  flight = c(1,2,7,4,5,14),
  weight = c(0.33,0.33,0.33,0.5,0.5,1)
)

Used data:

 Table simul:
    
date        city

2020-01-01  Paris
2020-01-01  Paris
2020-01-01  London
2020-01-01  Berlin


Table obs
---------
city   flight   weight

Paris   1        0.33
Paris   2        0.33
Paris   7        0.33
London  4        0.5
London  5        0.5
Berlin  14       1

Expected result:

date_f        city     flight
2020-01-01  Paris       2
2020-01-02  Paris       2
2020-01-03  London      4
2020-01-03  Berlin      14

I would like to use the data table package because the volume of data is very large. Here's what I tried but didn't work:

get_flight_sample <- function(param_city){
  table_simul[city==param_city]
  res <- sample(table_obs$flight,1, replace=T, prob = table_obs$weight)
}

res <- table_simul[,.(flight = get_flight_sample(city))]

CodePudding user response:

Maybe use a helper function to sample from table_obs and then join with table_simul

f <- function(i, ...) {
  if(length(i) == 1) i else sample(i, size = 1, ...)
}

set.seed(42)
tmp <- table_obs[, .(flight = f(flight, prob = weight)), by = city]
table_simul[, flight := tmp[table_simul, on = .(city)]$flight]
table_simul
#       date_f   city flight
#1: 2020-01-01  Paris      1
#2: 2020-01-02  Paris      1
#3: 2020-01-03 London      4
#4: 2020-01-03 Berlin     14

CodePudding user response:

Would something like this help?

simul_df <- data.frame()
sapply(c(1:1000), FUN = function(datos) {
  date_f = as.Date(sample(seq(as.Date('2021/01/01'), as.Date('2022/12/31'), by="day"), 1))
  city = sample(table_obs$city, size = 1)
  flight = sample(table_obs$flight, size = 1)
  weight = sample(table_obs$weight, size = 1)
  simul_df <<- as.data.frame(rbind(simul_df, cbind(date_f, city, flight, weight)))
})

At the end you get something like this:

   date_f   city flight weight
1   19342 London      4    0.5
2   19081  Paris      7    0.5
3   18916 London      4    0.5
4   19338  Paris      5    0.5
5   19313  Paris      4   0.33
6   18731  Paris      1   0.33
7   18877 London      5   0.33
8   18864  Paris      7    0.5
9   19126  Paris      5   0.33
10  19086  Paris      2   0.33

You will get a data frame that you can convert to data.table afterwards. Best

  • Related