Home > Software engineering >  R - Count unique values over two columns per group
R - Count unique values over two columns per group

Time:06-12

I have a dataframe Segments with a company identifier gvkey, year Year, and two columns with an indicator of the industry the company operates in SICS1 and SICS2. I would like to aggregate the data as such that I receive the count of unique industry identicators per company/year combination.

My data frame looks as follows (indicative):

    gvkey Year SICS1 SICS2
1    1209 2017  3569  3533
2    1209 2017  2813  3569
3    1209 2017  2813  3569
4    1209 2018  2813  3569
5    1209 2018  2813  7280
6    1209 2018  1908  3569
7    1209 2018  1412  3569
8    1209 2018  3569  3200
9    1503 2017  3569  3533
10   1503 2017  2813  3569
11   1503 2018  2813  3569
12   1503 2018  3569  3533
13   1503 2018  2813  3569

My desired output is should be somewhat like:

gvkey Year n_unique
1209  2017    3
1209  2018    6
1503  2017    3
1503  2018    3

What I tried to so far only gave me the unique values per column:

Segments %&%
group_by(gvkey, Year) %&%
summarize(across(SICS1:SICS2, n_distinct))

gvkey Year SICS1 SICS2
1209  2017   2     2
1209  2018   4     3
1503  2017   2     2
1503  2018   2     2

Simply summing up the unique values from SICS1 and SICS2 will not do the trick unfortunately, because it would result in counting some industry identifiers multiple times.

P.S. This is my first question on this platform. Please let me know how I can improve asking questions and thanks in advance for your help!

CodePudding user response:

You can filter the duplicated rows and then use summarize and count with group_by like this:

library(dplyr)
Segments %>%
  filter(complete.cases(.) & !duplicated(.)) %>% 
  group_by(gvkey, Year) %>%
  summarize(n_unique = n())

Output:

# A tibble: 4 × 3
# Groups:   gvkey [2]
  gvkey  Year n_unique
  <int> <int>    <int>
1  1209  2017        2
2  1209  2018        5
3  1503  2017        2
4  1503  2018        2

CodePudding user response:

You should use dput to make a reproducible example as shown below. This also shows an alternative use of mutate and unique to get the desired result.

library(dplyr)

# the output from dput(df) is used to make reproducible examples
df <- structure(
    list(gvkey = c(1209L, 1209L, 1209L, 1209L, 1209L, 1209L, 1209L, 1209L, 1503L, 1503L, 1503L, 1503L, 1503L), Year = c(2017L, 
    2017L, 2017L, 2018L, 2018L, 2018L, 2018L, 2018L, 2017L, 2017L, 
    2018L, 2018L, 2018L), SICS1 = c(3569L, 2813L, 2813L, 2813L, 2813L, 
    1908L, 1412L, 3569L, 3569L, 2813L, 2813L, 3569L, 2813L), SICS2 = c(3533L, 
    3569L, 3569L, 3569L, 7280L, 3569L, 3569L, 3200L, 3533L, 3569L, 
    3569L, 3533L, 3569L)), class = "data.frame", row.names = c(NA, 
    -13L))

summary <- 
    df %>% 
    mutate(combination=paste(SICS1, SICS2)) %>% 
    group_by(gvkey, Year) %>% 
    unique() %>%
    summarize(count = n())

summary
# A tibble: 4 × 3
# Groups:   gvkey [2]
  gvkey  Year count
  <int> <int> <int>
1  1209  2017     2
2  1209  2018     5
3  1503  2017     2
4  1503  2018     2
  • Related