Home > Software engineering >  Compute returns using lagged values on date when there are more than one row for the same date in R
Compute returns using lagged values on date when there are more than one row for the same date in R

Time:05-25

I have a dataset containing data for 6k asset and their marketprices.

I want to compute the daily returns, hence to apply the formula martketprice[i] - marketprice[i-1]/marketprice[i-1] The problem is that I have multiple observation for the same datetime, for example for asset x, I have 3 observation for the day t because it was traded by investor 1, 2 and 3. And so on so forth for every asset in the dataset. So my dataset can look like:

investor    asset    datetime      marketprice
1            x          t              10
2            x          t              10
3            x          t              10

My idea was to use something like

res <- res %>% 
  arrange(datetime) %>% 
  group_by(asset) %>% 
  mutate(ret = (marketprice - dplyr::lag(marketprice))/dplyr::lag(marketprice, default = NA)) %>% 
  ungroup()

but it doesn't work since, in the example above, for row 2 would mean use marketprice [i-1] which is the same day marketprice, while I want the previous day [t-1] to be used (not included in the example dataset)

Furthermore R should check that the [i-1] marketprice is not belonging to a date which is more than 4 days distant, hence if the date of row i is 10th of july, then the computation should apply only if the date [i-1] is 6th of july or closer.

Any idea?

CodePudding user response:

Based on the following assumptions I understand:

  1. When repeated day for the same asset, the marketprice is the same, not matter the investor.
  2. You don't mind which investor was (so we can remove rows)
  3. When day (t) is 5 days or ahead from previous (t-1), a NaN output is ok.

Libraries and some data example:

library(lubridate)
library(tidyverse)

# Data example

set.seed(132) # reproducibility

example = data.frame(
  investor = c(rep(1,3),2,3,rep(2,2),1,
               rep(2,4),rep(3,4)),
  asset = c(rep('A',8),
            rep('B',8)),
  datetime = c(today() c(1,2,3,3,3,4,5,6),
               today() c(1,seq(6,9),seq(16,18))),
  marketprice = c(10,20,30,30,30,sample(c(10,20,30),11,replace = TRUE))
)

Example dataset has 2 assets. First one (A) shows how the code deals with several rows for the same day. Second (B) shows how the code deals when there is a jump in dates greater than 4 days.

> example
   investor asset   datetime marketprice
1         1     A 2022-05-26          10
2         1     A 2022-05-27          20
3         1     A 2022-05-28          30
4         2     A 2022-05-28          30
5         3     A 2022-05-28          30
6         2     A 2022-05-29          30
7         2     A 2022-05-30          30
8         1     A 2022-05-31          30
9         2     B 2022-05-26          20
10        2     B 2022-05-31          10
11        2     B 2022-06-01          20
12        2     B 2022-06-02          10
13        3     B 2022-06-03          10
14        3     B 2022-06-10          30
15        3     B 2022-06-11          20
16        3     B 2022-06-12          10

Dplyr code:

# The formula is [price(t)-price(t-1)]/price(t-1) -> dif(price)/lag(price)
ret = example %>% 
  group_by(asset,datetime) %>% 
  slice(1) %>%  # remove repeated dates
  group_by(asset) %>% 
  arrange(datetime) %>% 
  mutate(ret = ifelse(datetime-lag(datetime) > 4,
                NA,
                (marketprice-lag(marketprice))/lag(marketprice))
         ) %>% # ifelse check the differences of days
  arrange(asset,datetime) # show by assets and dates

Output:

# A tibble: 14 x 5
# Groups:   asset [2]
   investor asset datetime   marketprice    ret
      <dbl> <chr> <date>           <dbl>  <dbl>
 1        1 A     2022-05-26          10 NA    
 2        1 A     2022-05-27          20  1    
 3        1 A     2022-05-28          30  0.5  
 4        2 A     2022-05-29          30  0    
 5        2 A     2022-05-30          30  0    
 6        1 A     2022-05-31          30  0    
 7        2 B     2022-05-26          20 NA    
 8        2 B     2022-05-31          10 NA    
 9        2 B     2022-06-01          20  1    
10        2 B     2022-06-02          10 -0.5  
11        3 B     2022-06-03          10  0    
12        3 B     2022-06-10          30 NA    
13        3 B     2022-06-11          20 -0.333
14        3 B     2022-06-12          10 -0.5 

2 rows dropped because a day had 3 entries of data.

  • Related