I have a dataset with over 845,000 observations. Here you can see a bit of the data, just to clarify what we are talking about:
COMPANY DIRECTOR YEAR
A Mark 2005
A Steven 2005
A Mark 2006
A Steven 2006
B John 2006
B Elias 2006
B Vince 2006
B John 2007
B Elias 2007
B Vince 2007
C Peter 2005
C Josh 2005
C Adam 2005
C Evan 2005
C Peter 2006
C Josh 2006
C Adam 2006
C Evan 2006
C Peter 2007
C Josh 2007
C Adam 2007
C Evan 2007
Now I want to create a fourth column that counts the number of directors for every company in every year. So, you get the following:
COMPANY DIRECTOR YEAR NUMBERDIRECTORS
A Mark 2005 2
A Steven 2005 2
A Mark 2006 2
A Steven 2006 2
B John 2006 3
B Elias 2006 3
B Vince 2006 3
B John 2007 3
B Elias 2007 3
B Vince 2007 3
C Peter 2005 4
C Josh 2005 4
C Adam 2005 4
C Evan 2005 4
C Peter 2006 4
C Josh 2006 4
C Adam 2006 4
C Evan 2006 4
C Peter 2007 4
C Josh 2007 4
C Adam 2007 4
C Evan 2007 4
I think I have to use something with a double condition, that R only sums the amount of directors if both the company and the year are the same at all the directors. But I don't know how to manage that.
CodePudding user response:
There are many ways you can do this, but using base R aggregate() is simple and requires no additional dependencies
x <- read.table(header = TRUE,
text = "COMPANY DIRECTOR YEAR
A Mark 2005
A Steven 2005
A Mark 2006
A Steven 2006
B John 2006
B Elias 2006
B Vince 2006
B John 2007
B Elias 2007
B Vince 2007
C Peter 2005
C Josh 2005
C Adam 2005
C Evan 2005
C Peter 2006
C Josh 2006
C Adam 2006
C Evan 2006
C Peter 2007
C Josh 2007
C Adam 2007
C Evan 2007"
)
aggregate(x$DIRECTOR, by = list(Company=x$COMPANY, Year=x$YEAR), function(y) length(unique(y)))
#> Company Year x
#> 1 A 2005 2
#> 2 C 2005 4
#> 3 A 2006 2
#> 4 B 2006 3
#> 5 C 2006 4
#> 6 B 2007 3
#> 7 C 2007 4
CodePudding user response:
library(tidyverse)
df %>%
group_by(COMPANY) %>%
mutate(NUMBERDIRECTORS = n_distinct(DIRECTOR))
# A tibble: 22 x 4
# Groups: COMPANY [3]
COMPANY DIRECTOR YEAR NUMBERDIRECTORS
<chr> <chr> <dbl> <int>
1 A Mark 2005 2
2 A Steven 2005 2
3 A Mark 2006 2
4 A Steven 2006 2
5 B John 2006 3
6 B Elias 2006 3
7 B Vince 2006 3
8 B John 2007 3
9 B Elias 2007 3
10 B Vince 2007 3
# ... with 12 more rows
# i Use `print(n = ...)` to see more rows