Count a categorical variable based on a group over time (the date column)-CodePudding

Say I have the following data:

date	name	rolename
2009-12-01	John	helper
2010-12-01	John	helper
2011-12-01	John	senior helper
2012-12-01	John	manager
2009-12-01	Will	helper
2010-12-01	Will	senior helper
2011-12-01	Will	manager
2012-12-01	Will	senior manager

I am trying to count the number of roles, based on the column rolename for the person in the name column, the person has worked to date. For example, for the above data, I want a fourth column which measures the number of positions a person has worked so far:

date	name	rolename	nopositions
2009-12-01	John	helper	1
2010-12-01	John	helper	1
2011-12-01	John	senior helper	2
2012-12-01	John	manager	3
2009-12-01	Will	helper	1
2010-12-01	Will	senior helper	2
2011-12-01	Will	manager	3
2012-12-01	Will	senior manager	4

My failed attempts:

#attempt 1
library(dplyr)

data %>%
group_by(name) %>%
mutate(nopositions = count(rolename))

#attempt2
library(runner)

data %>%
group_by(name) %>%
mutate(nopositions = runner(x = rolename,
                            k = inf,
                            idx = date,
                            f = function(x) length(x))

CodePudding user response：

Assuming that the order by date is assured,

library(dplyr)
quux %>%
  group_by(name) %>%
  mutate(noposition = cummax(match(rolename, unique(rolename)))) %>%
  ungroup()
# # A tibble: 8 × 4
#   date       name  rolename       noposition
#   <chr>      <chr> <chr>               <int>
# 1 2009-12-01 John  helper                  1
# 2 2010-12-01 John  helper                  1
# 3 2011-12-01 John  senior helper           2
# 4 2012-12-01 John  manager                 3
# 5 2009-12-01 Will  helper                  1
# 6 2010-12-01 Will  senior helper           2
# 7 2011-12-01 Will  manager                 3
# 8 2012-12-01 Will  senior manager          4

We might get away without cummax, except that if a name returns to a previous rolename, its noposition will decreases (revert to a previous value). However, we want to keep the most-recent max value.

This does assume that unique preserves the natural order of the first-occurrences. If something goes amiss with this (I cannot think of something off-hand), we could do a window of words:

quux %>%
  group_by(name) %>%
  mutate(noposition = sapply(seq_along(rolename), \(i) length(unique(rolename[1:i])))) %>%
  ungroup()
# # A tibble: 8 × 4
#   date       name  rolename       noposition
#   <chr>      <chr> <chr>               <int>
# 1 2009-12-01 John  helper                  1
# 2 2010-12-01 John  helper                  1
# 3 2011-12-01 John  senior helper           2
# 4 2012-12-01 John  manager                 3
# 5 2009-12-01 Will  helper                  1
# 6 2010-12-01 Will  senior helper           2
# 7 2011-12-01 Will  manager                 3
# 8 2012-12-01 Will  senior manager          4

This produces the same results here, and it will tend to perform more poorly with larger groups (as it is iterating a lot more). I offer it as an extension in case assumptions preclude the use of cummax(match(..)).