I have a data frame Segments
with a company identifier gvkey
, year Year
, and two columns with an indicator of the industry the company operates in SICS1
and SICS2
. I would like to aggregate the data as such that I receive the count of unique industry identicators per company/year combination.
My data frame looks as follows (indicative):
gvkey Year SICS1 SICS2
1 1209 2017 3569 3533
2 1209 2017 2813 3569
3 1209 2017 2813 3569
4 1209 2018 2813 3569
5 1209 2018 2813 7280
6 1209 2018 1908 3569
7 1209 2018 1412 3569
8 1209 2018 3569 3200
9 1503 2017 3569 3533
10 1503 2017 2813 3569
11 1503 2018 2813 3569
12 1503 2018 3569 3533
13 1503 2018 2813 3569
My desired output is should be somewhat like:
gvkey Year n_unique
1209 2017 3
1209 2018 6
1503 2017 3
1503 2018 3
What I tried to so far only gave me the unique values per column:
Segments %&%
group_by(gvkey, Year) %&%
summarize(across(SICS1:SICS2, n_distinct))
gvkey Year SICS1 SICS2
1209 2017 2 2
1209 2018 4 3
1503 2017 2 2
1503 2018 2 2
Simply summing up the unique values from SICS1
and SICS2
will not do the trick unfortunately, because it would result in counting some industry identifiers multiple times.
CodePudding user response:
Thanks for the answers! I was making it way too complicated. Your answers made me think differently about it, although it was not exactly what I was looking for. My apologies if I was not clear enough. The following answered my problem:
Segments.1 <- Segments[c("gvkey", "Year", "SICS1")]
Segments.1 <- Segments.1 %>% rename(
SICS = SICS1
)
Segments.2 <- Segments[c("gvkey", "Year", "SICS2")]
Segments.2 <- Segments.2 %>% rename(
SICS = SICS2
)
Segments.n <- rbind(Segments.1, Segments.2)
Diversification <- Segments.n %>%
group_by(gvkey, Year) %>%
summarize(Diversification = n_distinct(SICS))
CodePudding user response:
In summarise()
, you could use across()
to select multiple columns, unlist them to vectors and count the numbers of unique values by groups.
library(dplyr)
df %>%
group_by(gvkey, Year) %>%
summarise(n_unique = n_distinct(unlist(across(SICS1:SICS2)))) %>%
ungroup()
# # A tibble: 4 × 3
# gvkey Year n_unique
# <int> <int> <int>
# 1 1209 2017 3
# 2 1209 2018 6
# 3 1503 2017 3
# 4 1503 2018 3
Another way is that you need to stack SICS1
and SICS2
together first, and then you could count the number of unique values.
df %>%
tidyr::pivot_longer(SICS1:SICS2) %>%
group_by(gvkey, Year) %>%
summarise(n_unique = n_distinct(value)) %>%
ungroup()