I want to take a set of data, group it by one column, order it by another, and then count how many subsequent instances of a certain event happen. For example, in the data below... I want to add a column, call it nSubsqX, that tells me at each row, for that customer, how many subsequent orders have Product "X". Row 1 should result in 3 because rows 13:15 all are Customer C, Product X; Row 9 should result in 0 since there are no subsequent orders by Customer A.
Date Customer Product
1 2020-01-31 C X
2 2020-02-10 B X
3 2020-02-12 B Y
4 2020-03-04 B Z
5 2020-03-29 B X
6 2020-04-08 B X
7 2020-04-30 B X
8 2020-05-13 B X
9 2020-05-18 A X
10 2020-05-23 B Y
11 2020-07-02 B Y
12 2020-08-26 B Y
13 2020-09-19 C X
14 2020-10-13 C X
15 2020-11-11 C X
16 2020-12-06 B X
17 2020-12-26 C Y
For the purpose of providing a Reprex, below is the code to create the data frame.
df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12",
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18",
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11",
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))
I expect that I will need some sort of mutate function, but I can't quite get it right, I've tried:
df2 = df %>%
group_by(Customer) %>%
arrange(Customer, Date) %>%
mutate(
nSubsqX = length(Customer[which(Product == "X")]))
Which gives the TOTAL of times "X" shows up, but what I want is the number subsequently to this one. I also tried:
df2 = df %>%
group_by(Customer) %>%
arrange(Customer, Date) %>%
mutate(
nSubsqX = length(Customer[which(Product == "X" & Date > Date)]))
Which just returns 0 probably because Date > Date doesn't make any sense. I need a way to say Date > THIS Date. The solution I am trying to achieve is shown below:
Date Customer Product nSubsqX
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0
I think this is just an issue of not even knowing what words to search, so I'm sure there's something out there that would tell me what to do if I could figure out the right search criteria. I appreciate anyone pointing me in the right direction.
Thanks!
CodePudding user response:
Here is one option with tidyverse
- arrange
by 'Customer', 'Date', then grouped by 'Customer', replace
the vector of NA
elements, where the 'Product' is 'X' with rev
erse sequence of the count of 'X' values, then we either use tidyr::fill
or can use zoo::na.locf0
to fill the NA elements with previous non-NA values
library(dplyr)
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(new = zoo::na.locf0(replace(rep(NA_real_, n()),
Product == "X", rev(seq_len(sum(Product == "X")))-1))) %>%
ungroup
-output
# A tibble: 17 × 4
Date Customer Product new
<date> <chr> <chr> <dbl>
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0
Similar option can be done with data.table
library(data.table)
setDT(df)[order(Customer, Date)][Product == "X",
nSubsqx := rev(seq_len(.N)) - 1, Customer][,
nSubsqx := nafill(nSubsqx, "locf"), Customer][]
-output
ndex: <Product>
Date Customer Product nSubsqx
<Date> <char> <char> <num>
1: 2020-05-18 A X 0
2: 2020-02-10 B X 5
3: 2020-02-12 B Y 5
4: 2020-03-04 B Z 5
5: 2020-03-29 B X 4
6: 2020-04-08 B X 3
7: 2020-04-30 B X 2
8: 2020-05-13 B X 1
9: 2020-05-23 B Y 1
10: 2020-07-02 B Y 1
11: 2020-08-26 B Y 1
12: 2020-12-06 B X 0
13: 2020-01-31 C X 3
14: 2020-09-19 C X 2
15: 2020-10-13 C X 1
16: 2020-11-11 C X 0
17: 2020-12-26 C Y 0
CodePudding user response:
Here is a dplyr
only solution:
The trick is to substract the grouping number of X (e.g. cumsum(Product=="X")
from the sum of X (e.g. sum(Product=="X")
in each Customer
group:
library(dplyr)
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(nSubsqX1 = sum(Product=="X") - cumsum(Product=="X"))
Date Customer Product nSubsqX1
<date> <chr> <chr> <int>
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0
CodePudding user response:
How about something like this:
library(data.table)
setDT(df)[order(Customer,Date)] %>%
.[Product=="X", nSubsqX:=.N-1:.N, by=.(Customer)] %>%
.[order(Customer,Date),nSubsqX:=zoo::na.locf(id)] %>%
.[]
Output:
Date Customer Product nSubsqX
<Date> <char> <char> <int>
1: 2020-05-18 A X 0
2: 2020-02-10 B X 5
3: 2020-02-12 B Y 5
4: 2020-03-04 B Z 5
5: 2020-03-29 B X 4
6: 2020-04-08 B X 3
7: 2020-04-30 B X 2
8: 2020-05-13 B X 1
9: 2020-05-23 B Y 1
10: 2020-07-02 B Y 1
11: 2020-08-26 B Y 1
12: 2020-12-06 B X 0
13: 2020-01-31 C X 3
14: 2020-09-19 C X 2
15: 2020-10-13 C X 1
16: 2020-11-11 C X 0
17: 2020-12-26 C Y 0
data.table
explanation:
- use
setDT()
to set df to data.table - We order by Customer and then date
- in
i
, we limit to Product = "X" - in
j
, we create (by group) thenSubsqX
, by assigning the value as the number of rows in the group (which is.N
) minus the row number of the group (which can be generated on the fly as a sequence, like1:.N
) - in
by
, we set the group columns; in this case we want to group byCustomer
- use zoo::na.locf to fill