Creating a formula with a double condition-CodePudding

I have a dataset with over 845,000 observations. Here you can see a bit of the data, just to clarify what we are talking about:


COMPANY     DIRECTOR      YEAR      
A           Mark          2005      
A           Steven        2005      
A           Mark          2006      
A           Steven        2006      
B           John          2006      
B           Elias         2006      
B           Vince         2006      
B           John          2007      
B           Elias         2007      
B           Vince         2007      
C           Peter         2005      
C           Josh          2005      
C           Adam          2005      
C           Evan          2005      
C           Peter         2006      
C           Josh          2006      
C           Adam          2006      
C           Evan          2006      
C           Peter         2007      
C           Josh          2007      
C           Adam          2007      
C           Evan          2007

Now I want to create a fourth column that counts the number of directors for every company in every year. So, you get the following:

COMPANY     DIRECTOR      YEAR      NUMBERDIRECTORS
A           Mark          2005      2
A           Steven        2005      2
A           Mark          2006      2
A           Steven        2006      2
B           John          2006      3
B           Elias         2006      3
B           Vince         2006      3
B           John          2007      3
B           Elias         2007      3
B           Vince         2007      3
C           Peter         2005      4
C           Josh          2005      4
C           Adam          2005      4
C           Evan          2005      4
C           Peter         2006      4
C           Josh          2006      4
C           Adam          2006      4
C           Evan          2006      4
C           Peter         2007      4
C           Josh          2007      4
C           Adam          2007      4
C           Evan          2007      4

I think I have to use something with a double condition, that R only sums the amount of directors if both the company and the year are the same at all the directors. But I don't know how to manage that.

CodePudding user response：

There are many ways you can do this, but using base R aggregate() is simple and requires no additional dependencies

x <- read.table(header = TRUE,
  text = "COMPANY     DIRECTOR      YEAR
A           Mark          2005
A           Steven        2005
A           Mark          2006
A           Steven        2006
B           John          2006
B           Elias         2006
B           Vince         2006
B           John          2007
B           Elias         2007
B           Vince         2007
C           Peter         2005
C           Josh          2005
C           Adam          2005
C           Evan          2005
C           Peter         2006
C           Josh          2006
C           Adam          2006
C           Evan          2006
C           Peter         2007
C           Josh          2007
C           Adam          2007
C           Evan          2007"
)
aggregate(x$DIRECTOR, by = list(Company=x$COMPANY, Year=x$YEAR), function(y) length(unique(y)))
#>   Company Year x
#> 1       A 2005 2
#> 2       C 2005 4
#> 3       A 2006 2
#> 4       B 2006 3
#> 5       C 2006 4
#> 6       B 2007 3
#> 7       C 2007 4

CodePudding user response：

library(tidyverse)

df %>%  
  group_by(COMPANY) %>% 
  mutate(NUMBERDIRECTORS = n_distinct(DIRECTOR))

# A tibble: 22 x 4
# Groups:   COMPANY [3]
   COMPANY DIRECTOR  YEAR NUMBERDIRECTORS
   <chr>   <chr>    <dbl>           <int>
 1 A       Mark      2005               2
 2 A       Steven    2005               2
 3 A       Mark      2006               2
 4 A       Steven    2006               2
 5 B       John      2006               3
 6 B       Elias     2006               3
 7 B       Vince     2006               3
 8 B       John      2007               3
 9 B       Elias     2007               3
10 B       Vince     2007               3
# ... with 12 more rows
# i Use `print(n = ...)` to see more rows