I have a dataset in which I have customer id's, order id's and product id's, like so
customerID | orderID | Product ID |
---|---|---|
123 | 1 | A |
123 | 1 | B |
456 | 2 | C |
123 | 3 | C |
789 | 4 | D |
789 | 4 | D |
789 | 4 | E |
123 | 5 | C |
789 | 6 | D |
I'm looking for a way to only keep the product id rows for every customers first order, like so:
customerID | orderID | Product ID |
---|---|---|
123 | 1 | A |
123 | 1 | B |
456 | 2 | C |
789 | 4 | D |
789 | 4 | D |
789 | 4 | E |
I have tried
first_order <-
df %>%
group_by(customerid) %>%
filter(row_number()==1)
However, this gives me only the first row of a customerid, meaning it only returns one product id, where it can be possible for a order to include more than one product id.
I guess I should make changes to the filter(row_number()==1)
statement, but I'm not sure how.
Any help is much appreciated! Thanks in advance
CodePudding user response:
Use first
:
df %>%
group_by(customerID) %>%
filter(orderID == first(orderID))
In this case, using slice_min
also works, assuming that the first order has the lowest number.
df %>%
group_by(customerID) %>%
slice_min(orderID)
CodePudding user response:
In base R
,
subset(df1, orderID %in% orderID[!duplicated(customerID)])
-output
customerID orderID ProductID
1 123 1 A
2 123 1 B
3 456 2 C
5 789 4 D
6 789 4 D
7 789 4 E