I have this dataframe
df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2" "INV_2"),
ass = c("x", "x", "x", "y" "y", "x", "x", "x", "t", "t", "t"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"),
portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2),
G = (1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
class = "data.frame", row.names = c(NA, -5L))
which represents investor transactions in financial markets, so I have 4k different investors IDs and 6k different assets.
What I'm searching is a way to cumsum the variable G
for each investor*asset
combination.
In particular I want that the cumsum()
restart whenever that specific investor*asset
combination is paired with a portfolio == 0
.
So in the dataframe above I should get a new column called posdays
which should be equal to:
posdays = (1, 1, 0, 0, 0, 1, 2, 3, 1, 1, 1)
where the first 3 entries refers to INV_1*X
(notice the count restart in the third row because in the previous the portfolio == 0
), the fourth and fifth to INV_1*Y
then
INV_2*X
which cumsum the G variable for 3 times since the portfolio > 0
, and the last three refers to INV_2*T
where again the count restarts after the second entry since the portfolio == 0
I've tried something myself but I wasn't able to get what I'm looking for. My code is:
res <- res %>%
group_by(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
mutate(posdays = cumsum(G)) %>%
select(-group) %>%
ungroup
but in this way I'm not able to differenciate for investor and asset as I want. So basically I think I'm looking for a way to add a specification of investor*asset group_by in the previous code. But I have no idea of how since I have a low experience as an R user
Any idea?
CodePudding user response:
For anyone interested i've tried with this approach, not sure enough if it works.
res <- res %>%
group_by(investor, asset) %>%
mutate(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
group_by(investor, asset, group) %>%
mutate(posdays = cumsum(G)) %>%
select(-group) %>%
ungroup
CodePudding user response:
Some minor corrections to your original dataframe:
df <- structure(
list(
inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2"),
ass = c("x", "x", "x", "y", "y", "x", "x", "x", "t", "t", "t"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"),
G = c(1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2)
),
class = "data.frame", row.names = c(NA, -11L)
)
It looks like you've got the right idea with your own code. The trick is to create and group by a new column. Don't forget to ensure your data are correctly ordered before performing cumsum.
library(dplyr)
df_new <- df |>
arrange(inv, ass, datetime) |>
group_by(inv, ass) |>
mutate(
restart = lag(portfolio == 0, default = FALSE),
group = cumsum(restart)
) |>
group_by(inv, ass, group) |>
mutate(pos_days = cumsum(G))