Create a dummy variable for a group based on rolling values of a categorical variable-CodePudding

Say I have the following data:

date	name	rolename	firmname
2011-12-01	John	helper	A
2012-12-01	John	helper	A
2013-12-01	John	helper	A
2014-12-01	John	helper	B
2014-12-01	John	senior manager	C
2015-12-01	John	helper	B
2015-12-01	John	senior manager	C
2016-12-01	John	senior manager	C
2016-12-01	John	senior manager	D
2017-12-01	John	helper	E
2011-12-01	Will	senior manager	A
2012-12-01	Will	senior manager	A
2013-12-01	Will	senior manager	Z

I am trying to create a dummy variable for prior senior manager experience (dummy_sm_exp). That is, dummy_sm_exp equals 1 when the person has prior experience as a senior manager in some other firm, 0 otherwise. For instance, for the above data, a fifth column is created taking following values:

date	name	rolename	firmname	dummy_sm_exp
2011-12-01	John	helper	A	0
2012-12-01	John	helper	A	0
2013-12-01	John	helper	A	0
2014-12-01	John	helper	B	0
2014-12-01	John	senior manager	C	0
2015-12-01	John	helper	B	0
2015-12-01	John	senior manager	C	0
2016-12-01	John	senior manager	C	0
2016-12-01	John	senior manager	D	1
2017-12-01	John	helper	E	1
2011-12-01	Will	senior manager	A	0
2012-12-01	Will	senior manager	A	0
2013-12-01	Will	senior manager	Z	1

Note that the dummy becomes equal to 1 only when a person has prior senior manager work experience at another firm. Any hints? Thanks.

CodePudding user response：

I think this is the most robust way of testing whether each person has ever had management experience at a different company, i.e. by summarising by person and company before testing lagged experience:

library(tidyverse)


df <- read.table(
  text = "date  name    rolename    firmname
          2011-12-01    John    helper  A
          2012-12-01    John    helper  A
          2013-12-01    John    helper  A
          2014-12-01    John    helper  B
          2014-12-01    John    'senior manager'    C
          2015-12-01    John    helper  B
          2015-12-01    John    'senior manager'    C
          2016-12-01    John    'senior manager'    C
          2016-12-01    John    'senior manager'    D
          2017-12-01    John    helper  E
          2011-12-01    Will    'senior manager'    A
          2012-12-01    Will    'senior manager'    A
          2013-12-01    Will    'senior manager'    Z",
  header = TRUE
) |>
  as_tibble()

df |>
  group_by(name, firmname) |> 
  summarise(manager = any(rolename == "senior manager")) |> 
  mutate(prev_man = as.integer(lag(cumany(manager), default = FALSE))) |> 
  left_join(df, by = c("name", "firmname")) |> 
  select(date, name, rolename, firmname, prev_man)
#> `summarise()` has grouped output by 'name'. You can override using the
#> `.groups` argument.
#> # A tibble: 13 × 5
#> # Groups:   name [2]
#>    date       name  rolename       firmname prev_man
#>    <chr>      <chr> <chr>          <chr>       <int>
#>  1 2011-12-01 John  helper         A               0
#>  2 2012-12-01 John  helper         A               0
#>  3 2013-12-01 John  helper         A               0
#>  4 2014-12-01 John  helper         B               0
#>  5 2015-12-01 John  helper         B               0
#>  6 2014-12-01 John  senior manager C               0
#>  7 2015-12-01 John  senior manager C               0
#>  8 2016-12-01 John  senior manager C               0
#>  9 2016-12-01 John  senior manager D               1
#> 10 2017-12-01 John  helper         E               1
#> 11 2011-12-01 Will  senior manager A               0
#> 12 2012-12-01 Will  senior manager A               0
#> 13 2013-12-01 Will  senior manager Z               1

This will not distinguish between different senior manager roles at the same company. I'm not sure what John is doing in 2014 and 2015 - is he in both a helper and manager role in both years, so should get 0 in 2015 and 2016 as he hasn't changed company, or do you mean by this that he's switched between roles and should get 1? Will has not changed company in 2012 so gets a 0, but gets a 1 for 2013.