Home > Back-end >  How to filter for criteria where rows are compared to one another in R
How to filter for criteria where rows are compared to one another in R

Time:04-14

I have a dataframe:

UserId <- c("A", "A", "A", "B", "B", "B")
SellerId <- c("X", "X", "Y", "Y", "Z", "Z")
Product <- c("ball", "ball", "ball", "ball", "doll", "doll")
SalesDate <- c("2022-01-01", "2022-01-01", "2022-01-02", "2022-01-04", "2022-01-06", "2022-01-07")

sales <- data.frame(UserId, SellerId, Product, SalesDate)

And I want to find sales for which:

  • the same user bought the same product twice from the same seller on the same day, but of course I need to do it on a larger scale.

I've been thinking for a long time how to even use one of these criteria and nothing comes to mind. The table I should be left with in this case is:

| UserId   | SellerId  | Product | SalesDate    |
| -------- | --------- | ------- | ------------ |
| A        | X         | ball    | 2022-01-01   |
| A        | X         | ball    | 2022-01-01   |

UserId is the same, seller is the same, the product is the same and salesdate is the same. The problem is that I don't look for specific users or specific products.

I would like to find all users who bought the same product twice (no matter what the product is - the list is long), the same with purchasedate (the date doesn't matter, it needs to be the same for the same user).

Do you have any ideas how to do even a part of the code?

CodePudding user response:

Using dplyr, you can group_by_all variables, and filter out anything that do not have more than 1 records.

library(dplyr)

sales %>% group_by_all() %>% filter(n() > 1)

# A tibble: 2 × 4
# Groups:   UserId, SellerId, Product, SalesDate [1]
  UserId SellerId Product SalesDate 
  <chr>  <chr>    <chr>   <chr>     
1 A      X        ball    2022-01-01
2 A      X        ball    2022-01-01

CodePudding user response:

Group by all and use filter. The difference to @benson23 1 is to use across:

library(dplyr)
sales %>% 
  group_by(across(everything())) %>%
  filter( n() > 1 )

or even as everything() is default:

sales %>% 
  group_by(across()) %>%
  filter( n() > 1 )

CodePudding user response:

Using add_count() will give you the number of each occurence.

sales %>%
  add_count(UserId, SellerId, Product,  SalesDate)

  UserId SellerId Product  SalesDate n
1      A        X    ball 2022-01-01 2
2      A        X    ball 2022-01-01 2
3      A        Y    ball 2022-01-02 1
4      B        Y    ball 2022-01-04 1
5      B        Z    doll 2022-01-06 1
6      B        Z    doll 2022-01-07 1

from there on you can filter for n == 2 or n > 1 depending on your question.

  • Related