Cumsum on investor and asset combination with a condition to restart in R-CodePudding

I have this dataframe

df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2" "INV_2"), 
ass = c("x", "x", "x", "y" "y", "x", "x", "x", "t", "t", "t"), 
datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"), 
portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2), 
G = (1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1), 
class = "data.frame", row.names = c(NA, -5L))

which represents investor transactions in financial markets, so I have 4k different investors IDs and 6k different assets. What I'm searching is a way to cumsum the variable G for each investor*asset combination. In particular I want that the cumsum() restart whenever that specific investor*asset combination is paired with a portfolio == 0.

So in the dataframe above I should get a new column called posdays which should be equal to:

posdays = (1, 1, 0, 0, 0, 1, 2, 3, 1, 1, 1)

where the first 3 entries refers to INV_1*X (notice the count restart in the third row because in the previous the portfolio == 0), the fourth and fifth to INV_1*Y then INV_2*X which cumsum the G variable for 3 times since the portfolio > 0, and the last three refers to INV_2*T where again the count restarts after the second entry since the portfolio == 0

I've tried something myself but I wasn't able to get what I'm looking for. My code is:

res <- res %>%
  group_by(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
  mutate(posdays = cumsum(G)) %>%
  select(-group) %>% 
  ungroup

but in this way I'm not able to differenciate for investor and asset as I want. So basically I think I'm looking for a way to add a specification of investor*asset group_by in the previous code. But I have no idea of how since I have a low experience as an R user

Any idea?

CodePudding user response：

For anyone interested i've tried with this approach, not sure enough if it works.

res <- res %>%
  group_by(investor, asset) %>% 
  mutate(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
  group_by(investor, asset, group) %>% 
  mutate(posdays = cumsum(G)) %>%
  select(-group) %>% 
  ungroup

CodePudding user response：

Some minor corrections to your original dataframe:

df <- structure(
  list(
    inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2"),
    ass = c("x", "x", "x", "y", "y", "x", "x", "x", "t", "t", "t"),
    datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"),
    G = c(1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
    portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2)
  ),
  class = "data.frame", row.names = c(NA, -11L)
)

It looks like you've got the right idea with your own code. The trick is to create and group by a new column. Don't forget to ensure your data are correctly ordered before performing cumsum.

library(dplyr)

df_new <- df |> 
  arrange(inv, ass, datetime) |> 
  group_by(inv, ass) |> 
  mutate(
    restart = lag(portfolio == 0, default = FALSE),
    group = cumsum(restart)
  ) |> 
  group_by(inv, ass, group) |> 
  mutate(pos_days = cumsum(G))