I have a dataset with financial data. Sometimes, a product gets refunded, resulting in a negative count of the product (so the money gets returned). I want to conditionally filter these rows out of the dataset. Example:
library(tidyverse)
set.seed(1)
df <- tibble(
count = sample(c(-1,1),80,replace = TRUE,prob=c(.2,.8)),
id = rep(1:4,20)
)
df %>%
group_by(id) %>%
summarize(total = sum(count))
# A tibble: 4 x 2
id total
<int> <dbl>
1 1 10
2 2 14
3 3 16
4 4 10
id = 1 has 15 positive counts and 5 negatives. (15 - 5= 10)
. I want to keep 10 values in df with id = 1 with the positive values.
id = 2 has 17 positive counts and 3 negatives. (17- 3 = 14)
. I want to keep 14 values in df with id = 2 with the positive values.
In the end, this condition should be True nrow(df) == sum(df$count)
Unfortunately, a filtering join such as anti_join()
will remove all the rows. For some reason I cannot think of another option to filter the tibble.
Thanks for helping me!
CodePudding user response:
You can "uncount" using the total
column to get the number of repeats of each row.
df %>%
group_by(id) %>%
summarize(total = sum(count)) %>%
uncount(total) %>%
mutate(count = 1)
#> # A tibble: 50 x 2
#> id count
#> <int> <dbl>
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 4 1 1
#> 5 1 1
#> 6 1 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
#> # ... with 40 more rows
Created on 2022-10-21 with reprex v2.0.2