df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"), ass = c("x", "x", "x", "x", "x"), datetime = c("2010-01-01",
"2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19"), portfolio = c(10,
0, 5, 2, 0)), operation = c(10, -10, 5, -3, -2), class = "data.frame", row.names = c(NA, -5L))
So I have 4000
investors with 6000
different assets, for each investor I have his trading operations in two different variables: operation tells me if he is buying/selling; portfolio tells me how much he has in the portfolio.
What I want to do is computing the number of days a position stays open in the portfolio, so I though about computing the difference between the day in which the portfolio goes back to zero and the day in which the portfolio went positive (it is not possible to get negative portfolio).
so in the dataset above I would count row2 - row1 ==> 2010-01-02 - 2010-01-01
and row 5 - row 3 ==> 2010-01-19 - 2010-01-03
and so on...
I want to do this computation for all the investor & asset I have in my dataset for all the rows in which I find that portfolio > 0
.
So my dataset will have a further column called duration
which would be equal, in this case to c(0,1,0,5,16)
(so of course i also had to compute raw1 - raw1
and raw3 - raw3
)
Hence my problem is to restart the count everytime portfolio
goes back to zero.
CodePudding user response:
library(dplyr)
df %>%
mutate(datetime = as.Date(datetime, "%Y-%m-%d")) %>%
group_by(investor, asset) %>%
arrange(datetime) %>%
mutate(grp.pos = cumsum(lag(portfolio, default = 1) == 0)) %>%
group_by(investor, asset, grp.pos) %>%
mutate(`Open (#days)` = datetime - datetime[1])
#> # A tibble: 5 x 6
#> # Groups: investor, asset, grp.pos [2]
#> investor asset datetime portfolio grp.pos `Open (#days)`
#> <chr> <chr> <date> <dbl> <int> <drtn>
#> 1 INV_1 x 2010-01-01 10 0 0 days
#> 2 INV_1 x 2010-01-02 0 0 1 days
#> 3 INV_1 x 2010-01-03 5 1 0 days
#> 4 INV_1 x 2010-01-08 2 1 5 days
#> 5 INV_1 x 2010-01-19 0 1 16 days
Data:
df <- structure(list(investor = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"),
asset = c("x", "x", "x", "x", "x"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03",
"2010-01-08", "2010-01-19"),
portfolio = c(10, 0, 5, 2, 0)),
operation = c(10, -10, 5, -3, -2),
class = "data.frame", row.names = c(NA, -5L))
CodePudding user response:
Here is a way how we could do it, that is expandable if necessary for ass
First we group by inv
to use for the original dataset. Then transform datetime
to date format to do calculations easily (here we use ymd()
function).
The next step could be done in different ways:
Main idea is to group the column portfolio
indicated by the last row of the group that is 0. For this we arrange datetime
in descending form to easily apply the grouping id with cumsum == 0
.
After rearranging datetime
we can calculate the last from the first as intended:
library(dplyr)
library(lubridate)
df %>%
group_by(inv) %>%
mutate(datetime = ymd(datetime)) %>%
arrange(desc(datetime)) %>%
group_by(position_Group = cumsum(portfolio==0)) %>%
arrange(datetime) %>%
mutate(position_open = last(datetime)-first(datetime)) %>%
ungroup()
inv ass datetime portfolio operation id_Group position_open
<chr> <chr> <date> <dbl> <dbl> <int> <drtn>
1 INV_1 x 2010-01-01 10 10 2 1 days
2 INV_1 x 2010-01-02 0 -10 2 1 days
3 INV_1 x 2010-01-03 5 5 1 16 days
4 INV_1 x 2010-01-08 2 -3 1 16 days
5 INV_1 x 2010-01-19 0 -2 1 16 days