Home > Software design >  R For Loop and If-else data.table
R For Loop and If-else data.table

Time:10-28

I'm stuck on a for-loop that I'm trying to create. The example dataset is below:

ex <- structure(list(person_id = c("79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65"), prs_nat_key = c("8240588160001", 
"8240588160001", "8240588160001", "8240588160001", "106705689", 
"106705689", "106705689", "106705689"), serv_from_dt = structure(c(18262, 
18262, 18262, 18262, 18278, 18278, 18278, 18278), class = "Date"), 
    serv_to_dt = structure(c(18262, 18262, 18262, 18265, 18282, 
    18282, 18299, 18299), class = "Date"), new_pos = c("IP", 
    "IP", "IP", "IP", "IP", "IP", "IP", "IP"), days_diff = c(0, 
    0, 0, 3, 4, 4, 21, 21)), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

I'm trying to create a new column, called start_date. This column would be created based off of the serv_from_dt and serv_to_dt dates of each person_id. The way I'm doing this so far is as follows:

find the unique serv_from_dt's by each person_id where the date differences between serv_from_dt and serv_to_dt are greater than 0 (let's just call this diff_date); if, by row, the serv_frm_dt is >= the person_id's MAX unique diff_date, and the serv_to_dt <= person_id's MAX unique diff_date, then label as that unique diff_date. I have this so far:

 values=ex[,.(uniqueN(sort(unique(serv_to_dt[ex$days_diff>0]), TRUE))), person_id]
    n = as.numeric(values[,1])
    m = as.numeric(values[,2])

for (i in m){
  ex[,`:=`(min_start = fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1] & 
                             serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[1]), 
                           sort(unique(serv_from_dt[ex$days_diff>0]))[1], fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i] & 
                                                                                     serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[i]), 
                                                                                  sort(unique(serv_from_dt[ex$days_diff>0]))[i], serv_from_dt)),
           max_end = fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1] & 
                                  serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[1]), 
                               sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1], fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i] & 
                                                                                         serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[i]), 
                                                                                      sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i], serv_from_dt))), prs_nat_key]
}

the above code gives me exactly what I want, but I don't know how to scale this for a larger dataset with multiple person_ids and multiple day_diffs. I would like the code to be such that if the serv_frm/serv_to_dts do not hold true of being between the max unique diff_date, to loop to the next unique diff_date. In this case, both person_id's only have 1 unique diff_date (so m = 1), but I would like to update the code to hold true in instances where m > 1. I've also tried doing it using base R but keep getting errors:

for(j in 1:m){

    
    ex[, min_start := if((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[j] & 
                          serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[j])) sort(unique(serv_from_dt[ex$days_diff>0]))[j]]
  j = j  1
  
}

any help would be greatly appreciated.

CodePudding user response:

My end goal was to create two new columns called min_start and max_end. I realized instead of doing ifelse statements, I could do a join. Here are my steps using a slightly larger example dataset:

ex <- structure(list(person_id = c("79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65"), prs_nat_key = c("8240588160001", 
"8240588160001", "8240588160001", "8240588160001", "8240588160001", 
"8240588160001", "8240588160001", "8240588160001", "106705689", 
"106705689", "106705689", "106705689"), serv_from_dt = structure(c(18262, 
18262, 18262, 18262, 18275, 18275, 18275, 18275, 18278, 18278, 
18278, 18278), class = "Date"), serv_to_dt = structure(c(18262, 
18262, 18262, 18265, 18275, 18278, 18278, 18278, 18282, 18282, 
18299, 18299), class = "Date"), new_pos = c("IP", "IP", "IP", 
"IP", "IP", "IP", "IP", "IP", "IP", "IP", "IP", "IP"), days_diff = c(0, 
0, 0, 3, 0, 3, 3, 3, 4, 4, 21, 21)), row.names = c(NA, -12L), class = c("data.table", 
"data.frame"))

create a new dataframe that has only the unique start/end dates for each person:

date_period <- ex[, .(unique_start = unique(serv_from_dt[days_diff>0]),
                      unique_end = unique(serv_to_dt[days_diff>0])), prs_nat_key][order(prs_nat_key,unique_start,-unique_end),]

date_period %<>% distinct(prs_nat_key, unique_start, .keep_all = TRUE) %>% setDT()

do a left join on this condition: if date_period$prs_nat_key = ex$prs_nat_key & ex$serv_from_dt >= date_period$unique_start & ex$serv_from_dt <= date_period$unique_end & ex$serv_to_dt >= date_period$unique_start & ex$serv_to_dt <= date_period$unique_end

ex[, c("start_date", "end_date") := 
             date_period[ex, # join
                 .(unique_start, unique_end),
                 on = .(unique_start < serv_from_dt,
                        unique_start < serv_to_dt,
                        unique_end > serv_to_dt,
                        unique_end > serv_from_dt,
                        prs_nat_key = prs_nat_key)]]

which I found from this question --> Conditional join in data.table?

CodePudding user response:

Not sure what your final outcome should be, but it looks overcomplicated. For example the date_period table you created could be done like this:

ex[, .(unique_start = first(serv_from_dt), unique_end = last(serv_to_dt)), by = c("prs_nat_key", "serv_from_dt")]

#      prs_nat_key serv_from_dt unique_start unique_end
# 1: 8240588160001   2020-01-01   2020-01-01 2020-01-04
# 2: 8240588160001   2020-01-14   2020-01-14 2020-01-17
# 3:     106705689   2020-01-17   2020-01-17 2020-02-07

As it seems that you try to rejoin it back to the original table than perhaps this is what you want. Yes this is all that is needed from the original table you posted.

ex[, `:=` (start_date = first(serv_from_dt), end_date = last(serv_to_dt)), by = c("prs_nat_key", "serv_from_dt")]

#                                person_id   prs_nat_key serv_from_dt serv_to_dt new_pos days_diff start_date   end_date
#  1: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  2: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  3: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  4: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-04      IP         3 2020-01-01 2020-01-04
#  5: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-14      IP         0 2020-01-14 2020-01-17
#  6: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  7: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  8: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  9: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-01-21      IP         4 2020-01-17 2020-02-07
# 10: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-01-21      IP         4 2020-01-17 2020-02-07
# 11: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-02-07      IP        21 2020-01-17 2020-02-07
# 12: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-02-07      IP        21 2020-01-17 2020-02-07
  • Related