I have an issue while working with a dataset at work. The dataset contains a list of transactions made on our website. However, when a customer returns a product it adds another row in the dataset containing a negative value. An example of how the data could look:
Person 1 $150
Person 1 -$150
Person 1 $150
Person 2 $100
Person 2 -$100
Person 3 $50
Person 3 $20
To use this data for analysis, I am looking to remove the negative value and one value from the same person that has the same amount. The desired output of a new dataframe would be:
Person 1 $150
Person 3 $50
Person 3 $20
Any tips on how to solve this issue?
CodePudding user response:
A possible hack if customers doesn't buy again before they return:
library(dplyr)
library(stringr)
df |>
group_by(person) |>
mutate(transaction = if_else(str_detect(amount, "-"), row_number()-1, row_number())) |>
group_by(person, transaction) |>
filter(n() == 1) |>
ungroup()
Output:
# A tibble: 3 × 3
person amount transaction
<chr> <chr> <dbl>
1 Person1 $150 3
2 Person3 $50 1
3 Person3 $20 2
CodePudding user response:
As mentioned in comments that way of identify transactions seems pretty unreliable. If you has no choice anyway, you must mark refunded transactions as follows:
data <- data.frame(person=c(1,1,1,2,2,3,3), tx=c(150,-150,150,100,-100,50,20))
s <- lapply(split(data, interaction(data$person, abs(data$tx))),\(x){
cnt <- with(x, min(length(tx[tx<0]), length(tx[tx>0])));
within(x, refund <- cumsum(tx<0) <= cnt & cumsum(tx>0) <= cnt ) })
unsplit(s, interaction(data$person, abs(data$tx)))
#> person tx refund
#> 1 1 150 TRUE
#> 2 1 -150 TRUE
#> 3 1 150 FALSE
#> 4 2 100 TRUE
#> 5 2 -100 TRUE
#> 6 3 50 FALSE
#> 7 3 20 FALSE