Home > Mobile >  Create a dummary variable based on lagged conditional match of two categorical variables in R
Create a dummary variable based on lagged conditional match of two categorical variables in R

Time:12-22

I am trying to create a dummy variable that is equal to 1 when the person is hired from outside the company (external hire) and equal to 0 when the person is hired from inside the company (promotion).

My data contains the annual date, company name, employee name, and role name. An example of my data is given below:

date company_name employee_name role_name
2013-01-01 A Zoe manager
2014-01-01 A Zoe senior manager
2015-01-01 A Zoe senior manager
2015-01-01 B Zoe director
2016-01-01 B Zoe director
2017-01-01 C Zoe director
2013-01-01 A Morgan manager
2014-01-01 A Morgan senior manager
2015-01-01 A Morgan director
2016-01-01 A Morgan director
2017-01-01 A Morgan director
2013-01-01 A John manager
2014-01-01 B John senior manager
2015-01-01 A John director
2016-01-01 A John director
2017-01-01 A John director

Based on the above data, I want to create a dummy variable outsider_dummy. Dummy variable is equal to 1 when the person is hired from outside the company (external hire) and equal to 0 when the person is hired from inside the company (promotion).

date company_name employee_name role_name outsider_dummy
2013-01-01 A Zoe manager 0
2014-01-01 A Zoe senior manager 0
2015-01-01 A Zoe senior manager 0
2015-01-01 B Zoe director 1
2016-01-01 B Zoe director 1
2017-01-01 C Zoe director 1
2013-01-01 A Morgan manager 0
2014-01-01 A Morgan senior manager 0
2015-01-01 A Morgan director 0
2016-01-01 A Morgan director 0
2017-01-01 A Morgan director 0
2013-01-01 A John manager 0
2014-01-01 B John senior manager 1
2015-01-01 A John director 1
2016-01-01 A John director 1
2017-01-01 A John director 1

Zoe: Zoe at companies B and C is an outsider since she joined companies B from A and then C from B.

Morgan: Morgan is not an outsider since Morgan has worked at company A throughout the record.

John: John is an outsider at company B since he joined company A. John is again an outsider to company A from 2015-2017 since he joined company B in 2014.

Any idea on how to achieve this?

CodePudding user response:

This was some challenge I wanted to try as I'm not that strong and still learning. I didn't succeed in doing it in base R and did it with help of tidyverse. This isn't optimal and I think you can refine the answer, but it seems to do the trick. The idea was to use an ordered factor to keep track on a promotion and the lag function in dplyr. Here is the code with some comments:

  library(tidyverse)
  donnees %>% 
  # To keep track of double lines of director in same company
  # And ordered factor for promotion check
  mutate(concat = paste(company_name, role_name, sep = "-"),
         role_name = factor(role_name, levels = c("manager", "senior manager", "director"), ordered = TRUE)) %>%
  # Set the first director to 1 if the previous company is not the same or employee is not director
  # Replace the first NA which is the first line of each employee with 0
  group_by(employee_name) %>% 
  mutate(outsider_dummy = as.numeric(role_name >= lag(role_name) & company_name != lag(company_name)),
         outsider_dummy = replace_na(outsider_dummy, 0)) %>% 
  # For double line of director at the same company, fill 0's with 1's
  group_by(employee_name, concat) %>% 
  mutate(outsider_dummy = first(outsider_dummy)) %>%
  ungroup() %>% 
  select(-concat)

I hope this will help you.

CodePudding user response:

One approach is to use cumall which is a cumulative "all" from dplyr. This will make the new dummy variable 0 if all rows up to the current row are the same as the first row for company_name. If not, then will be set at 1. This assumes the dates are in chronological order, oldest first.

library(tidyverse)

df %>%
  group_by(employee_name) %>%
  mutate(outsider_dummy =  !cumall(company_name == first(company_name)))

Output

   date       company_name employee_name role_name      outsider_dummy
   <chr>      <chr>        <chr>         <chr>                   <int>
 1 2013-01-01 A            Zoe           manager                     0
 2 2014-01-01 A            Zoe           senior_manager              0
 3 2015-01-01 A            Zoe           senior_manager              0
 4 2015-01-01 B            Zoe           director                    1
 5 2016-01-01 B            Zoe           director                    1
 6 2017-01-01 C            Zoe           director                    1
 7 2013-01-01 A            Morgan        manager                     0
 8 2014-01-01 A            Morgan        senior_manager              0
 9 2015-01-01 A            Morgan        director                    0
10 2016-01-01 A            Morgan        director                    0
11 2017-01-01 A            Morgan        director                    0
12 2013-01-01 A            John          manager                     0
13 2014-01-01 B            John          senior_manager              1
14 2015-01-01 A            John          director                    1
15 2016-01-01 A            John          director                    1
16 2017-01-01 A            John          director                    1
  • Related