I have the following df as example:
df <- data.frame(status = c(rep("egr", 5), rep("ing", 5)),
ua = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2),
fam = c(rep("fam1", 2), rep("fam2", 7), "fam3"),
spp = c(rep("spp1", 3), rep("spp2", 3), "rep4", "rep5", rep("spp6", 2)))
What I'm trying to do is summarise
the count of strings based on a group_by
of the ua
column and the differences between the names of fam
and spp
when comparing the two status
(ing, egr).
In other words, for each ua
, I want to count via summarise
the differences between ing
- egr
for each other column (spp
and fam
). I'm having an issue specifically assigning the status
column as the base for the differences, i.e., getting the names of spp
and fam
at each status
before summarizing. A setdiff
between fam
or spp
from each status
seems enough, but I'm failing at, again, assigning the status
before the summary.
EDIT:
Getting the different number of names in ing
minus egr
, an output might look like
output <- data.frame(ua = c(1, 2),
fam = c(-1, 2),
spp = c(1, 1))
Example rationale:
ua
1 has, considering both egr
and ing
, two names of fam
(fam1 and fam2). Since fam2 is shared between ing
and egr
, the diff is 0. But egr
has also fam1
, then the difference became -1.
Another example: ua
2 has no egr
, then the sum is simply 2 (fam2 and fam3).
Again: ua
1 has, considering both ing
and egr
, four spp
(spp1, spp2, rep4, rep5). Doing the ing
minus egr
, it must result in 1 because ing
has two unique names and (rep4 and rep5) and egr
one (spp1). Thus, the diff of ing
- egr
is 1 since spp2 is shared
I hope this is understandable, it's kinda tricky to explain.
CodePudding user response:
Similar to @arg0naut91's answer, but instead of pivoting the data, we complete
the data:
library(tidyverse)
df %>%
group_by(ua, status) %>% #for each ua and status
summarise(across(c(fam, spp), ~ length(unique(.x))), .groups = "drop") %>% #get the number of unique values (~ n_distinct(.x)) would also work
complete(ua, status, fill = list(fam = 0, spp = 0)) %>% #complete the data such that every ua has both status (but with 0's if the row was missing)
arrange(status) %>% #arrange the data so we get the `diff` in the correct order
group_by(ua) %>%
summarise(across(c(fam, spp), ~ diff(.x)))
Result:
# A tibble: 2 × 3
ua fam spp
<dbl> <int> <int>
1 1 -1 1
2 2 2 1
CodePudding user response:
That could work, although rather convoluted:
library(tidyverse)
df %>%
pivot_longer(fam:spp) %>%
distinct %>%
group_by(ua, name, value) %>%
filter(n() == 1L) %>%
group_by(ua, name) %>%
summarise(value = n_distinct(value[status == 'ing']) - n_distinct(value[status == 'egr'])) %>%
pivot_wider(id_cols = ua, names_from = name, values_from = value)
Output:
# A tibble: 2 x 3
# Groups: ua [2]
ua fam spp
<dbl> <int> <int>
1 1 -1 1
2 2 2 1