counting rows between two specific rows with a condition-CodePudding

df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"), ass = c("x", "x", "x", "x", "x"), datetime = c("2010-01-01", 
"2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19"), portfolio = c(10, 
0, 5, 2, 0)), operation = c(10, -10, 5, -3, -2), class = "data.frame", row.names = c(NA, -5L))

So I have 4000 investors with 6000 different assets, for each investor I have his trading operations in two different variables: operation tells me if he is buying/selling; portfolio tells me how much he has in the portfolio.

What I want to do is computing the number of days a position stays open in the portfolio, so I though about computing the difference between the day in which the portfolio goes back to zero and the day in which the portfolio went positive (it is not possible to get negative portfolio).

so in the dataset above I would count row2 - row1 ==> 2010-01-02 - 2010-01-01 and row 5 - row 3 ==> 2010-01-19 - 2010-01-03 and so on...

I want to do this computation for all the investor & asset I have in my dataset for all the rows in which I find that portfolio > 0.

So my dataset will have a further column called duration which would be equal, in this case to c(0,1,0,5,16) (so of course i also had to compute raw1 - raw1 and raw3 - raw3) Hence my problem is to restart the count everytime portfolio goes back to zero.

CodePudding user response：

library(dplyr)

df %>% 
  mutate(datetime = as.Date(datetime, "%Y-%m-%d")) %>% 
  group_by(investor, asset) %>% 
  arrange(datetime) %>% 
  mutate(grp.pos = cumsum(lag(portfolio, default = 1) == 0)) %>%
  group_by(investor, asset, grp.pos) %>% 
  mutate(`Open (#days)` = datetime - datetime[1])

#> # A tibble: 5 x 6
#> # Groups:   investor, asset, grp.pos [2]
#>   investor asset datetime   portfolio grp.pos `Open (#days)`
#>   <chr>    <chr> <date>         <dbl>   <int> <drtn>        
#> 1 INV_1    x     2010-01-01        10       0  0 days       
#> 2 INV_1    x     2010-01-02         0       0  1 days       
#> 3 INV_1    x     2010-01-03         5       1  0 days       
#> 4 INV_1    x     2010-01-08         2       1  5 days       
#> 5 INV_1    x     2010-01-19         0       1 16 days

Data:

df <- structure(list(investor = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"), 
                     asset = c("x", "x", "x", "x", "x"), 
                     datetime = c("2010-01-01", "2010-01-02", "2010-01-03", 
                                  "2010-01-08", "2010-01-19"), 
                     portfolio = c(10, 0, 5, 2, 0)), 
                     operation = c(10, -10, 5, -3, -2), 
                class = "data.frame", row.names = c(NA, -5L))

CodePudding user response：

Here is a way how we could do it, that is expandable if necessary for ass

First we group by inv to use for the original dataset. Then transform datetime to date format to do calculations easily (here we use ymd() function).

The next step could be done in different ways:

Main idea is to group the column portfolio indicated by the last row of the group that is 0. For this we arrange datetime in descending form to easily apply the grouping id with cumsum == 0.

After rearranging datetime we can calculate the last from the first as intended:

library(dplyr)
library(lubridate)

df %>% 
  group_by(inv) %>% 
  mutate(datetime = ymd(datetime)) %>% 
  arrange(desc(datetime)) %>% 
  group_by(position_Group = cumsum(portfolio==0)) %>% 
  arrange(datetime) %>% 
  mutate(position_open = last(datetime)-first(datetime)) %>% 
  ungroup()

  inv   ass   datetime   portfolio operation id_Group position_open
  <chr> <chr> <date>         <dbl>     <dbl>    <int> <drtn>       
1 INV_1 x     2010-01-01        10        10        2  1 days      
2 INV_1 x     2010-01-02         0       -10        2  1 days      
3 INV_1 x     2010-01-03         5         5        1 16 days      
4 INV_1 x     2010-01-08         2        -3        1 16 days      
5 INV_1 x     2010-01-19         0        -2        1 16 days