Home > Net >  Use lagged columns in data.table in another column
Use lagged columns in data.table in another column

Time:08-21

I have the following problem:

dt = data.table(check=c("H", "L","L","L","H","L","L","L"),
           old = c(8, 3, 2.8, 2.7, 2.3, 2.3, 1.7, 1.4),
           new = c(19, 2.8, 2.7, 2.3, 21, 1.7, 1.4, 1.1),
           beg_low = rep(3.0, 8),
           beg_high = rep(8, 8),
           band_low = rep(0, 8),
           band_high = rep(0, 8))
> dt
   check old  new beg_low beg_high band_low band_high
1:     H 8.0 19.0       3        8        0         0
2:     L 3.0  2.8       3        8        0         0
3:     L 2.8  2.7       3        8        0         0
4:     L 2.7  2.3       3        8        0         0
5:     H 2.3 21.0       3        8        0         0
6:     L 2.3  1.7       3        8        0         0
7:     L 1.7  1.4       3        8        0         0
8:     L 1.4  1.1       3        8        0         0

Here we need to update the values of band_low and band_high in the following manner:

The fields band_low and band_high need to be filled using the values of the fields check, new, beg_low and beg_high.

band_low and band_high take their initial values from beg_low and beg_high.

If check is "H", then band_low would remain unchanged, i.e., it would be equal to beg_low. band_high, on the other hand, would be revised, using the field new. In the data we have, band_high would now become 19.

If check is "L", on the other hand, then band_high would be unchanged, and would retain its previous value. band_low would be revised, and would now become 2.8, in our data. Subsequent values of check are all "L", hence band_low keeps getting revised using new, but band_high remains unchanged.

So the answer would be:

dt.ans = data.table(check=c("H", "L","L","L","H","L","L","L"),
                old = c(8, 3, 2.8, 2.7, 2.3, 2.3, 1.7, 1.4),
                new = c(19, 2.8, 2.7, 2.3, 21, 1.7, 1.4, 1.1),
                beg_low = rep(3.0, 8),
                beg_high = rep(8, 8),
                band_low = c(3,2.8, 2.7, 2.3, 2.3, 1.7, 1.4, 1.1),
                band_high = c(19, 19, 19, 19, 21, 21, 21,21))

> dt.ans
   check old  new beg_low beg_high band_low band_high
1:     H 8.0 19.0       3        8      3.0        19
2:     L 3.0  2.8       3        8      2.8        19
3:     L 2.8  2.7       3        8      2.7        19
4:     L 2.7  2.3       3        8      2.3        19
5:     H 2.3 21.0       3        8      2.3        21
6:     L 2.3  1.7       3        8      1.7        21
7:     L 1.7  1.4       3        8      1.4        21
8:     L 1.4  1.1       3        8      1.1        21

I tried to solve this using data.table:

dt[, ':=' (band_low = beg_low,
           band_high = beg_high)]

dt[, band_low := ifelse(check=="L", new, shift(band_low)), 1:nrow(dt)]  
dt[, band_high := ifelse(check=="L", shift(band_high), new), 1:nrow(dt)]  

dt[, band_high := ifelse(check=="H", new, shift(band_high)), 1:nrow(dt)]  
dt[, band_low := ifelse(check=="H", shift(band_low), new), 1:nrow(dt)]  

> dt
   check old  new beg_low beg_high band_low band_high
1:     H 8.0 19.0       3        8       NA        19
2:     L 3.0  2.8       3        8      2.8        NA
3:     L 2.8  2.7       3        8      2.7        NA
4:     L 2.7  2.3       3        8      2.3        NA
5:     H 2.3 21.0       3        8       NA        21
6:     L 2.3  1.7       3        8      1.7        NA
7:     L 1.7  1.4       3        8      1.4        NA
8:     L 1.4  1.1       3        8      1.1        NA

Somehow the previous values of the variables are being missed, with the result being NAs.

What am I missing here please? Solving the problem using if clauses in a data.frame manner is possible, but very slow, and not scalable.

Any help would be greatly appreciated.

CodePudding user response:

You could use nafill with locf (last occurence carried forward) option :

dt[,band_high:=NA]
dt[,band_low:=NA]

dt[,band_high:=fifelse(check=='H',new,band_high)]
dt[,band_low:=fifelse(check=='L',new,band_low)]

dt[,band_high:=nafill(band_high,'locf')]
dt[,band_low:=nafill(band_low,'locf')]

dt[is.na(band_low),band_low:=beg_low]
dt[is.na(band_high),band_high:=beg_high]

all.equal(dt,dt.ans)

#[1] TRUE
  • Related