How can I count a number of conditional rows within r dplyr mutate?-CodePudding

I want to take a set of data, group it by one column, order it by another, and then count how many subsequent instances of a certain event happen. For example, in the data below... I want to add a column, call it nSubsqX, that tells me at each row, for that customer, how many subsequent orders have Product "X". Row 1 should result in 3 because rows 13:15 all are Customer C, Product X; Row 9 should result in 0 since there are no subsequent orders by Customer A.

         Date Customer Product
1  2020-01-31        C       X
2  2020-02-10        B       X
3  2020-02-12        B       Y
4  2020-03-04        B       Z
5  2020-03-29        B       X
6  2020-04-08        B       X
7  2020-04-30        B       X
8  2020-05-13        B       X
9  2020-05-18        A       X
10 2020-05-23        B       Y
11 2020-07-02        B       Y
12 2020-08-26        B       Y
13 2020-09-19        C       X
14 2020-10-13        C       X
15 2020-11-11        C       X
16 2020-12-06        B       X
17 2020-12-26        C       Y

For the purpose of providing a Reprex, below is the code to create the data frame.

df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12", 
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18", 
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11", 
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))

I expect that I will need some sort of mutate function, but I can't quite get it right, I've tried:

df2 = df %>%
  group_by(Customer) %>%
  arrange(Customer, Date) %>%
  mutate(
    nSubsqX = length(Customer[which(Product == "X")]))

Which gives the TOTAL of times "X" shows up, but what I want is the number subsequently to this one. I also tried:

df2 = df %>%
  group_by(Customer) %>%
  arrange(Customer, Date) %>%
  mutate(
    nSubsqX = length(Customer[which(Product == "X" & Date > Date)]))

Which just returns 0 probably because Date > Date doesn't make any sense. I need a way to say Date > THIS Date. The solution I am trying to achieve is shown below:

   Date       Customer Product nSubsqX
 1 2020-05-18 A        X             0
 2 2020-02-10 B        X             5
 3 2020-02-12 B        Y             5
 4 2020-03-04 B        Z             5
 5 2020-03-29 B        X             4
 6 2020-04-08 B        X             3
 7 2020-04-30 B        X             2
 8 2020-05-13 B        X             1
 9 2020-05-23 B        Y             1
10 2020-07-02 B        Y             1
11 2020-08-26 B        Y             1
12 2020-12-06 B        X             0
13 2020-01-31 C        X             3
14 2020-09-19 C        X             2
15 2020-10-13 C        X             1
16 2020-11-11 C        X             0
17 2020-12-26 C        Y             0

I think this is just an issue of not even knowing what words to search, so I'm sure there's something out there that would tell me what to do if I could figure out the right search criteria. I appreciate anyone pointing me in the right direction.

Thanks!

CodePudding user response：

Here is one option with tidyverse - arrange by 'Customer', 'Date', then grouped by 'Customer', replace the vector of NA elements, where the 'Product' is 'X' with reverse sequence of the count of 'X' values, then we either use tidyr::fill or can use zoo::na.locf0 to fill the NA elements with previous non-NA values

library(dplyr)
df %>% 
  arrange(Customer, Date) %>% 
  group_by(Customer) %>% 
  mutate(new = zoo::na.locf0(replace(rep(NA_real_, n()), 
      Product == "X", rev(seq_len(sum(Product == "X")))-1))) %>%
  ungroup

-output

# A tibble: 17 × 4
   Date       Customer Product   new
   <date>     <chr>    <chr>   <dbl>
 1 2020-05-18 A        X           0
 2 2020-02-10 B        X           5
 3 2020-02-12 B        Y           5
 4 2020-03-04 B        Z           5
 5 2020-03-29 B        X           4
 6 2020-04-08 B        X           3
 7 2020-04-30 B        X           2
 8 2020-05-13 B        X           1
 9 2020-05-23 B        Y           1
10 2020-07-02 B        Y           1
11 2020-08-26 B        Y           1
12 2020-12-06 B        X           0
13 2020-01-31 C        X           3
14 2020-09-19 C        X           2
15 2020-10-13 C        X           1
16 2020-11-11 C        X           0
17 2020-12-26 C        Y           0

Similar option can be done with data.table

library(data.table)
setDT(df)[order(Customer, Date)][Product == "X", 
   nSubsqx := rev(seq_len(.N)) - 1, Customer][, 
      nSubsqx := nafill(nSubsqx, "locf"), Customer][]

-output

ndex: <Product>
          Date Customer Product nSubsqx
        <Date>   <char>  <char>   <num>
 1: 2020-05-18        A       X       0
 2: 2020-02-10        B       X       5
 3: 2020-02-12        B       Y       5
 4: 2020-03-04        B       Z       5
 5: 2020-03-29        B       X       4
 6: 2020-04-08        B       X       3
 7: 2020-04-30        B       X       2
 8: 2020-05-13        B       X       1
 9: 2020-05-23        B       Y       1
10: 2020-07-02        B       Y       1
11: 2020-08-26        B       Y       1
12: 2020-12-06        B       X       0
13: 2020-01-31        C       X       3
14: 2020-09-19        C       X       2
15: 2020-10-13        C       X       1
16: 2020-11-11        C       X       0
17: 2020-12-26        C       Y       0

CodePudding user response：

Here is a dplyr only solution:

The trick is to substract the grouping number of X (e.g. cumsum(Product=="X") from the sum of X (e.g. sum(Product=="X") in each Customer group:

library(dplyr)

  df %>%
    arrange(Customer, Date) %>%
    group_by(Customer) %>%
    mutate(nSubsqX1 = sum(Product=="X") - cumsum(Product=="X"))

   Date       Customer Product nSubsqX1
   <date>     <chr>    <chr>      <int>
 1 2020-05-18 A        X              0
 2 2020-02-10 B        X              5
 3 2020-02-12 B        Y              5
 4 2020-03-04 B        Z              5
 5 2020-03-29 B        X              4
 6 2020-04-08 B        X              3
 7 2020-04-30 B        X              2
 8 2020-05-13 B        X              1
 9 2020-05-23 B        Y              1
10 2020-07-02 B        Y              1
11 2020-08-26 B        Y              1
12 2020-12-06 B        X              0
13 2020-01-31 C        X              3
14 2020-09-19 C        X              2
15 2020-10-13 C        X              1
16 2020-11-11 C        X              0
17 2020-12-26 C        Y              0

CodePudding user response：

How about something like this:

library(data.table)
setDT(df)[order(Customer,Date)] %>% 
  .[Product=="X", nSubsqX:=.N-1:.N, by=.(Customer)] %>% 
  .[order(Customer,Date),nSubsqX:=zoo::na.locf(id)] %>% 
  .[]

Output:

          Date Customer Product    nSubsqX
        <Date>   <char>  <char> <int>
 1: 2020-05-18        A       X     0
 2: 2020-02-10        B       X     5
 3: 2020-02-12        B       Y     5
 4: 2020-03-04        B       Z     5
 5: 2020-03-29        B       X     4
 6: 2020-04-08        B       X     3
 7: 2020-04-30        B       X     2
 8: 2020-05-13        B       X     1
 9: 2020-05-23        B       Y     1
10: 2020-07-02        B       Y     1
11: 2020-08-26        B       Y     1
12: 2020-12-06        B       X     0
13: 2020-01-31        C       X     3
14: 2020-09-19        C       X     2
15: 2020-10-13        C       X     1
16: 2020-11-11        C       X     0
17: 2020-12-26        C       Y     0

data.table explanation:

use setDT() to set df to data.table
We order by Customer and then date
in i, we limit to Product = "X"
in j, we create (by group) the nSubsqX, by assigning the value as the number of rows in the group (which is .N) minus the row number of the group (which can be generated on the fly as a sequence, like 1:.N)
in by, we set the group columns; in this case we want to group by Customer
use zoo::na.locf to fill