Summarise names using a column value as a pre-filter-CodePudding

I have the following df as example:

df <- data.frame(status = c(rep("egr", 5), rep("ing", 5)),
                 ua = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2),
                 fam = c(rep("fam1", 2), rep("fam2", 7), "fam3"),
                 spp = c(rep("spp1", 3), rep("spp2", 3), "rep4", "rep5", rep("spp6", 2)))

What I'm trying to do is summarise the count of strings based on a group_by of the ua column and the differences between the names of fam and spp when comparing the two status (ing, egr).

In other words, for each ua, I want to count via summarise the differences between ing - egr for each other column (spp and fam). I'm having an issue specifically assigning the status column as the base for the differences, i.e., getting the names of spp and fam at each status before summarizing. A setdiff between fam or spp from each status seems enough, but I'm failing at, again, assigning the status before the summary.

EDIT:

Getting the different number of names in ing minus egr, an output might look like

output <- data.frame(ua = c(1, 2),
                     fam = c(-1, 2),
                     spp = c(1, 1))

Example rationale:

ua 1 has, considering both egr and ing, two names of fam (fam1 and fam2). Since fam2 is shared between ing and egr, the diff is 0. But egr has also fam1, then the difference became -1.

Another example: ua 2 has no egr, then the sum is simply 2 (fam2 and fam3).

Again: ua 1 has, considering both ing and egr, four spp (spp1, spp2, rep4, rep5). Doing the ing minus egr, it must result in 1 because ing has two unique names and (rep4 and rep5) and egr one (spp1). Thus, the diff of ing - egr is 1 since spp2 is shared

I hope this is understandable, it's kinda tricky to explain.

CodePudding user response：

Similar to @arg0naut91's answer, but instead of pivoting the data, we complete the data:

library(tidyverse)
df %>%
  group_by(ua, status) %>% #for each ua and status
  summarise(across(c(fam, spp), ~ length(unique(.x))), .groups = "drop") %>% #get the number of unique values (~ n_distinct(.x)) would also work
  complete(ua, status, fill = list(fam = 0, spp = 0)) %>% #complete the data such that every ua has both status (but with 0's if the row was missing)
  arrange(status) %>% #arrange the data so we get the `diff` in the correct order
  group_by(ua) %>%
  summarise(across(c(fam, spp), ~ diff(.x)))

Result:

# A tibble: 2 × 3
     ua   fam   spp
  <dbl> <int> <int>
1     1    -1     1
2     2     2     1

CodePudding user response：

That could work, although rather convoluted:

library(tidyverse)

df %>%
  pivot_longer(fam:spp) %>%
  distinct %>%
  group_by(ua, name, value) %>%
  filter(n() == 1L) %>%
  group_by(ua, name) %>%
  summarise(value = n_distinct(value[status == 'ing']) - n_distinct(value[status == 'egr'])) %>%
  pivot_wider(id_cols = ua, names_from = name, values_from = value)

Output:

# A tibble: 2 x 3
# Groups:   ua [2]
     ua   fam   spp
  <dbl> <int> <int>
1     1    -1     1
2     2     2     1