Home > Software design >  Count unique values over two columns per group
Count unique values over two columns per group

Time:06-13

I have a data frame Segments with a company identifier gvkey, year Year, and two columns with an indicator of the industry the company operates in SICS1 and SICS2. I would like to aggregate the data as such that I receive the count of unique industry identicators per company/year combination.

My data frame looks as follows (indicative):

    gvkey Year SICS1 SICS2
1    1209 2017  3569  3533
2    1209 2017  2813  3569
3    1209 2017  2813  3569
4    1209 2018  2813  3569
5    1209 2018  2813  7280
6    1209 2018  1908  3569
7    1209 2018  1412  3569
8    1209 2018  3569  3200
9    1503 2017  3569  3533
10   1503 2017  2813  3569
11   1503 2018  2813  3569
12   1503 2018  3569  3533
13   1503 2018  2813  3569

My desired output is should be somewhat like:

gvkey Year n_unique
1209  2017    3
1209  2018    6
1503  2017    3
1503  2018    3

What I tried to so far only gave me the unique values per column:

Segments %&%
group_by(gvkey, Year) %&%
summarize(across(SICS1:SICS2, n_distinct))

gvkey Year SICS1 SICS2
1209  2017   2     2
1209  2018   4     3
1503  2017   2     2
1503  2018   2     2

Simply summing up the unique values from SICS1 and SICS2 will not do the trick unfortunately, because it would result in counting some industry identifiers multiple times.

CodePudding user response:

Thanks for the answers! I was making it way too complicated. Your answers made me think differently about it, although it was not exactly what I was looking for. My apologies if I was not clear enough. The following answered my problem:

Segments.1 <- Segments[c("gvkey", "Year", "SICS1")]
Segments.1 <- Segments.1 %>% rename(
  SICS = SICS1
)
Segments.2 <- Segments[c("gvkey", "Year", "SICS2")]
Segments.2 <- Segments.2 %>% rename(
  SICS  = SICS2
)
Segments.n <- rbind(Segments.1, Segments.2)
Diversification <- Segments.n %>% 
  group_by(gvkey, Year) %>% 
  summarize(Diversification = n_distinct(SICS))

CodePudding user response:

In summarise(), you could use across() to select multiple columns, unlist them to vectors and count the numbers of unique values by groups.

library(dplyr)

df %>%
  group_by(gvkey, Year) %>%
  summarise(n_unique = n_distinct(unlist(across(SICS1:SICS2)))) %>%
  ungroup()

# # A tibble: 4 × 3
#   gvkey  Year n_unique
#   <int> <int>    <int>
# 1  1209  2017        3
# 2  1209  2018        6
# 3  1503  2017        3
# 4  1503  2018        3

Another way is that you need to stack SICS1 and SICS2 together first, and then you could count the number of unique values.

df %>%
  tidyr::pivot_longer(SICS1:SICS2) %>%
  group_by(gvkey, Year) %>%
  summarise(n_unique = n_distinct(value)) %>%
  ungroup()
  • Related