Home > Blockchain >  In R, how can I group by one column and conditionally sum another?
In R, how can I group by one column and conditionally sum another?

Time:05-06

This is an add on to my previous question: How can I count a number of conditional rows within r dplyr mutate?

Let's say I have the data frame below. In my previous question I asked how I could calculate at each row how many subsequent times that row's customer ordered Product X (literally X, not the Product associated with the row), which is now given in nSubsqX. Now, I want to know the sum of the cost associated with those subsequent orders of X. I have manually entered the answer into nCostSubsqX below, but I don't understand how to do it programmatically.

   Date       Customer Product  cost nSubsqX nCostSubsqX
 1 2020-05-18 A        X           9       0           0
 2 2020-02-10 B        X           2       5          42
 3 2020-02-12 B        Y           3       5          42
 4 2020-03-04 B        Z           4       5          42
 5 2020-03-29 B        X           5       4          37
 6 2020-04-08 B        X           6       3          31
 7 2020-04-30 B        X           7       2          24
 8 2020-05-13 B        X           8       1           5
 9 2020-05-23 B        Y          10       1           5
10 2020-07-02 B        Y          11       1           5
11 2020-08-26 B        Y          12       1           5
12 2020-12-06 B        X          16       0           0
13 2020-01-31 C        X           1       3          42
14 2020-09-19 C        X          13       2          60
15 2020-10-13 C        X          14       1          15
16 2020-11-11 C        X          15       0           0
17 2020-12-26 C        Y          17       0           0

For the purpose of providing a Reprex, below is the code to create the data frame.

df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12", 
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18", 
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11", 
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))

df$cost = seq(nrow(df))

Below is the code that gets me nSubsqX:

df %>%
  arrange(Customer, Date) %>%
  group_by(Customer) %>%
  mutate(
    nSubsqX = sum(Product=="X") - cumsum(Product=="X"))

Now I need to understand how to make the array the rows where Product is X, but from the cost column rather than from the Product column itself. Any thoughts?

Attempt 1, gives an error.

df %>%
  arrange(Customer, Date) %>%
  group_by(Customer) %>%
  mutate(
    nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
    nCostSubsqX = sum(cost[which(Product == "X")]) - cumsum(cost[which(Product == "X")]))
...
Error in `mutate_cols()`:
  Problem with `mutate()` column `nCostSubsqX`.
  `nCostSubsqX = sum(cost[which(Product == "X")]) - ...`.
  `nCostSubsqX` must be size 11 or 1, not 6.
  The error occurred in group 2: Customer = "B".

Attempt 2, where the math isn't right. The nCostSubsqX column needs to have the cum cost to this point removed.

df %>%
  arrange(Customer, Date) %>%
  group_by(Customer) %>%
  mutate(
    nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
    nCostSubsqX = zoo::na.locf0(replace(rep(NA_real_, n()), 
                                        Product == "X", rev(seq_len(sum(cost[which(Product == "X")]))))))
...
   Date       Customer Product  cost nSubsqX nCostSubsqX
 1 2020-05-18 A        X           9       0           9
 2 2020-02-10 B        X           2       5          44
 3 2020-02-12 B        Y           3       5          44
 4 2020-03-04 B        Z           4       5          44
 5 2020-03-29 B        X           5       4          43
 6 2020-04-08 B        X           6       3          42
 7 2020-04-30 B        X           7       2          41
 8 2020-05-13 B        X           8       1          40
 9 2020-05-23 B        Y          10       1          40
10 2020-07-02 B        Y          11       1          40
11 2020-08-26 B        Y          12       1          40
12 2020-12-06 B        X          16       0          39
13 2020-01-31 C        X           1       3          43
14 2020-09-19 C        X          13       2          42
15 2020-10-13 C        X          14       1          41
16 2020-11-11 C        X          15       0          40
17 2020-12-26 C        Y          17       0          40

Attempt 3, I don't know what the math here is doing, but it ain't right!

df %>%
  arrange(Customer, Date) %>%
  group_by(Customer) %>%
  mutate(
    nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
    nCostSubsqX = zoo::na.locf0(replace(rep(NA_real_, n()), 
                       Product == "X", rev(seq_len(sum(cost[which(Product == "X")])))))-
                  zoo::na.locf0(ifelse(Product == "X",cumsum(cost[which(Product == "X")]),NA)))

CodePudding user response:

Attempt 1 was almost there. It's important that the number of rows is maintained. Replace cost[which(Product == "X")] with cost*(Product=="X") (a dirty trick). Btw. the which is unnecessary.

The snippet would be:

df %>%
  arrange(Customer, Date) %>%
  group_by(Customer) %>%
  mutate(
    nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
    nCostSubsqX = sum(cost[Product == "X"]) - cumsum(cost*(Product == "X")))

CodePudding user response:

Here is a slightly different approach, if you are interested.

library(data.table)

f <- function(p,co=rep(1,length(p))) {
  sapply(seq_along(p), \(i) sum(co[-i:0][p[-i:0]=="X"]))
}

setDT(df)[
  order(Date,Customer),
  `:=`(nSubsqX = f(Product),nCostSubsqx=f(Product, cost)),
  by=Customer
]

In this approach, I actually use the same function f() for both nSubsqX, and for nCostSubsqx; the only difference is whether cost is additionally passed to f() as the co parameter, or the default co parameter is used.

Output:

          Date Customer Product  cost nSubsqX nCostSubsqx
        <Date>   <char>  <char> <int>   <num>       <int>
 1: 2020-01-31        C       X     1       3          42
 2: 2020-02-10        B       X     2       5          42
 3: 2020-02-12        B       Y     3       5          42
 4: 2020-03-04        B       Z     4       5          42
 5: 2020-03-29        B       X     5       4          37
 6: 2020-04-08        B       X     6       3          31
 7: 2020-04-30        B       X     7       2          24
 8: 2020-05-13        B       X     8       1          16
 9: 2020-05-18        A       X     9       0           0
10: 2020-05-23        B       Y    10       1          16
11: 2020-07-02        B       Y    11       1          16
12: 2020-08-26        B       Y    12       1          16
13: 2020-09-19        C       X    13       2          29
14: 2020-10-13        C       X    14       1          15
15: 2020-11-11        C       X    15       0           0
16: 2020-12-06        B       X    16       0           0
17: 2020-12-26        C       Y    17       0           0
  • Related