Say I have the following data
:
date | name | rolename |
---|---|---|
2009-12-01 | John | helper |
2010-12-01 | John | helper |
2011-12-01 | John | senior helper |
2012-12-01 | John | manager |
2009-12-01 | Will | helper |
2010-12-01 | Will | senior helper |
2011-12-01 | Will | manager |
2012-12-01 | Will | senior manager |
I am trying to count the number of roles, based on the column rolename
for the person in the name
column, the person has worked to date. For example, for the above data, I want a fourth column which measures the number of positions a person has worked so far:
date | name | rolename | nopositions |
---|---|---|---|
2009-12-01 | John | helper | 1 |
2010-12-01 | John | helper | 1 |
2011-12-01 | John | senior helper | 2 |
2012-12-01 | John | manager | 3 |
2009-12-01 | Will | helper | 1 |
2010-12-01 | Will | senior helper | 2 |
2011-12-01 | Will | manager | 3 |
2012-12-01 | Will | senior manager | 4 |
My failed attempts:
#attempt 1
library(dplyr)
data %>%
group_by(name) %>%
mutate(nopositions = count(rolename))
#attempt2
library(runner)
data %>%
group_by(name) %>%
mutate(nopositions = runner(x = rolename,
k = inf,
idx = date,
f = function(x) length(x))
CodePudding user response:
Assuming that the order by date
is assured,
library(dplyr)
quux %>%
group_by(name) %>%
mutate(noposition = cummax(match(rolename, unique(rolename)))) %>%
ungroup()
# # A tibble: 8 × 4
# date name rolename noposition
# <chr> <chr> <chr> <int>
# 1 2009-12-01 John helper 1
# 2 2010-12-01 John helper 1
# 3 2011-12-01 John senior helper 2
# 4 2012-12-01 John manager 3
# 5 2009-12-01 Will helper 1
# 6 2010-12-01 Will senior helper 2
# 7 2011-12-01 Will manager 3
# 8 2012-12-01 Will senior manager 4
We might get away without cummax
, except that if a name
returns to a previous rolename
, its noposition
will decreases (revert to a previous value). However, we want to keep the most-recent max value.
This does assume that unique
preserves the natural order of the first-occurrences. If something goes amiss with this (I cannot think of something off-hand), we could do a window of words:
quux %>%
group_by(name) %>%
mutate(noposition = sapply(seq_along(rolename), \(i) length(unique(rolename[1:i])))) %>%
ungroup()
# # A tibble: 8 × 4
# date name rolename noposition
# <chr> <chr> <chr> <int>
# 1 2009-12-01 John helper 1
# 2 2010-12-01 John helper 1
# 3 2011-12-01 John senior helper 2
# 4 2012-12-01 John manager 3
# 5 2009-12-01 Will helper 1
# 6 2010-12-01 Will senior helper 2
# 7 2011-12-01 Will manager 3
# 8 2012-12-01 Will senior manager 4
This produces the same results here, and it will tend to perform more poorly with larger groups (as it is iterating a lot more). I offer it as an extension in case assumptions preclude the use of cummax(match(..))
.