R dplyr group summarize and filter when ID exists in multiple groups-CodePudding

I have a table with ID, Category and amount with a few thousand records.

data:

df1 <- data.frame(
  ID = c('V1', 'V1', 'V1', 'V3', 'V3', 'V3', 'V4', 'V5','V5','V5'),
  Category = c('a', 'a', 'a', 'a', 'b', 'b', 'a', 'b', 'c', 'c'),
  Amount   = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1))

Using dplyr I want to group by ID and Category, sum the total amount per group, then filter the results to only have IDs which exist in multiple category.

result:

ID  Category Amount_Sum
V3  a        1
V3  b        2
V5  b        1
V5  c        2

I have the following code which groups and sums, but missing how to filter when the ID is in multiple groups

code:

x <- df1 %>% 
       group_by(ID, Category) %>% 
       summarize(CNT = n(), amount = sum(Amount)) %>% 
       filter(????????)

CodePudding user response：

Using n_distinct on the Category should give you your desired result:

library(dplyr)

df1 %>% 
  group_by(ID, Category) %>% 
  summarize(CNT = n(), amount = sum(Amount)) %>% 
  filter(n_distinct(Category) > 1) %>%
  ungroup()

returns

# A tibble: 4 x 4
  ID    Category   CNT amount
  <chr> <chr>    <int>  <dbl>
1 V3    a            1      1
2 V3    b            2      2
3 V5    b            1      1
4 V5    c            2      2

CodePudding user response：

You can also use a combination of length and unique to filter as well:

library(dplyr)

df1 %>% 
  group_by(ID, Category) %>% 
  summarize(CNT = n(), amount = sum(Amount)) %>% 
  filter(length(unique(Category)) > 1)

Output

  ID    Category   CNT amount
  <chr> <chr>    <int>  <dbl>
1 V3    a            1      1
2 V3    b            2      2
3 V5    b            1      1
4 V5    c            2      2

Or here is a base R option using aggregate to do the summary, then using ave to do the filtering.

df1_output <-
  setNames(do.call(
    data.frame,
    aggregate(
      Amount ~ ID   Category,
      data = df1,
      FUN = function(x)
        c(CNT = length(x), amount = sum(x))
    )
  ), c(names(df1[1:2]), "CNT", "amount"))


df1_output[with(df1_output, ave(Category, ID, FUN = function(x) length(unique(x))) > 1),]