I have a dataframe Segments
with a company identifier gvkey
, year Year
, and two columns with an indicator of the industry the company operates in SICS1
and SICS2
. I would like to aggregate the data as such that I receive the count of unique industry identicators per company/year combination.
My data frame looks as follows (indicative):
gvkey Year SICS1 SICS2
1 1209 2017 3569 3533
2 1209 2017 2813 3569
3 1209 2017 2813 3569
4 1209 2018 2813 3569
5 1209 2018 2813 7280
6 1209 2018 1908 3569
7 1209 2018 1412 3569
8 1209 2018 3569 3200
9 1503 2017 3569 3533
10 1503 2017 2813 3569
11 1503 2018 2813 3569
12 1503 2018 3569 3533
13 1503 2018 2813 3569
My desired output is should be somewhat like:
gvkey Year n_unique
1209 2017 3
1209 2018 6
1503 2017 3
1503 2018 3
What I tried to so far only gave me the unique values per column:
Segments %&%
group_by(gvkey, Year) %&%
summarize(across(SICS1:SICS2, n_distinct))
gvkey Year SICS1 SICS2
1209 2017 2 2
1209 2018 4 3
1503 2017 2 2
1503 2018 2 2
Simply summing up the unique values from SICS1
and SICS2
will not do the trick unfortunately, because it would result in counting some industry identifiers multiple times.
P.S. This is my first question on this platform. Please let me know how I can improve asking questions and thanks in advance for your help!
CodePudding user response:
You can filter
the duplicated rows and then use summarize
and count
with group_by
like this:
library(dplyr)
Segments %>%
filter(complete.cases(.) & !duplicated(.)) %>%
group_by(gvkey, Year) %>%
summarize(n_unique = n())
Output:
# A tibble: 4 × 3
# Groups: gvkey [2]
gvkey Year n_unique
<int> <int> <int>
1 1209 2017 2
2 1209 2018 5
3 1503 2017 2
4 1503 2018 2
CodePudding user response:
You should use dput
to make a reproducible example as shown below. This also shows an alternative use of mutate
and unique
to get the desired result.
library(dplyr)
# the output from dput(df) is used to make reproducible examples
df <- structure(
list(gvkey = c(1209L, 1209L, 1209L, 1209L, 1209L, 1209L, 1209L, 1209L, 1503L, 1503L, 1503L, 1503L, 1503L), Year = c(2017L,
2017L, 2017L, 2018L, 2018L, 2018L, 2018L, 2018L, 2017L, 2017L,
2018L, 2018L, 2018L), SICS1 = c(3569L, 2813L, 2813L, 2813L, 2813L,
1908L, 1412L, 3569L, 3569L, 2813L, 2813L, 3569L, 2813L), SICS2 = c(3533L,
3569L, 3569L, 3569L, 7280L, 3569L, 3569L, 3200L, 3533L, 3569L,
3569L, 3533L, 3569L)), class = "data.frame", row.names = c(NA,
-13L))
summary <-
df %>%
mutate(combination=paste(SICS1, SICS2)) %>%
group_by(gvkey, Year) %>%
unique() %>%
summarize(count = n())
summary
# A tibble: 4 × 3
# Groups: gvkey [2]
gvkey Year count
<int> <int> <int>
1 1209 2017 2
2 1209 2018 5
3 1503 2017 2
4 1503 2018 2