I am trying to create a dummy variable that is equal to 1 when the person is hired from outside the company (external hire) and equal to 0 when the person is hired from inside the company (promotion).
My data contains the annual date, company name, employee name, and role name. An example of my data is given below:
date | company_name | employee_name | role_name |
---|---|---|---|
2013-01-01 | A | Zoe | manager |
2014-01-01 | A | Zoe | senior manager |
2015-01-01 | A | Zoe | senior manager |
2015-01-01 | B | Zoe | director |
2016-01-01 | B | Zoe | director |
2017-01-01 | C | Zoe | director |
2013-01-01 | A | Morgan | manager |
2014-01-01 | A | Morgan | senior manager |
2015-01-01 | A | Morgan | director |
2016-01-01 | A | Morgan | director |
2017-01-01 | A | Morgan | director |
2013-01-01 | A | John | manager |
2014-01-01 | B | John | senior manager |
2015-01-01 | A | John | director |
2016-01-01 | A | John | director |
2017-01-01 | A | John | director |
Based on the above data, I want to create a dummy variable outsider_dummy
. Dummy variable is equal to 1 when the person is hired from outside the company (external hire) and equal to 0 when the person is hired from inside the company (promotion).
date | company_name | employee_name | role_name | outsider_dummy |
---|---|---|---|---|
2013-01-01 | A | Zoe | manager | 0 |
2014-01-01 | A | Zoe | senior manager | 0 |
2015-01-01 | A | Zoe | senior manager | 0 |
2015-01-01 | B | Zoe | director | 1 |
2016-01-01 | B | Zoe | director | 1 |
2017-01-01 | C | Zoe | director | 1 |
2013-01-01 | A | Morgan | manager | 0 |
2014-01-01 | A | Morgan | senior manager | 0 |
2015-01-01 | A | Morgan | director | 0 |
2016-01-01 | A | Morgan | director | 0 |
2017-01-01 | A | Morgan | director | 0 |
2013-01-01 | A | John | manager | 0 |
2014-01-01 | B | John | senior manager | 1 |
2015-01-01 | A | John | director | 1 |
2016-01-01 | A | John | director | 1 |
2017-01-01 | A | John | director | 1 |
Zoe: Zoe at companies B and C is an outsider since she joined companies B from A and then C from B.
Morgan: Morgan is not an outsider since Morgan has worked at company A throughout the record.
John: John is an outsider at company B since he joined company A. John is again an outsider to company A from 2015-2017 since he joined company B in 2014.
Any idea on how to achieve this?
CodePudding user response:
This was some challenge I wanted to try as I'm not that strong and still learning. I didn't succeed in doing it in base R and did it with help of tidyverse. This isn't optimal and I think you can refine the answer, but it seems to do the trick. The idea was to use an ordered factor to keep track on a promotion and the lag function in dplyr. Here is the code with some comments:
library(tidyverse)
donnees %>%
# To keep track of double lines of director in same company
# And ordered factor for promotion check
mutate(concat = paste(company_name, role_name, sep = "-"),
role_name = factor(role_name, levels = c("manager", "senior manager", "director"), ordered = TRUE)) %>%
# Set the first director to 1 if the previous company is not the same or employee is not director
# Replace the first NA which is the first line of each employee with 0
group_by(employee_name) %>%
mutate(outsider_dummy = as.numeric(role_name >= lag(role_name) & company_name != lag(company_name)),
outsider_dummy = replace_na(outsider_dummy, 0)) %>%
# For double line of director at the same company, fill 0's with 1's
group_by(employee_name, concat) %>%
mutate(outsider_dummy = first(outsider_dummy)) %>%
ungroup() %>%
select(-concat)
I hope this will help you.
CodePudding user response:
One approach is to use cumall
which is a cumulative "all" from dplyr
. This will make the new dummy variable 0 if all rows up to the current row are the same as the first
row for company_name
. If not, then will be set at 1. This assumes the dates are in chronological order, oldest first.
library(tidyverse)
df %>%
group_by(employee_name) %>%
mutate(outsider_dummy = !cumall(company_name == first(company_name)))
Output
date company_name employee_name role_name outsider_dummy
<chr> <chr> <chr> <chr> <int>
1 2013-01-01 A Zoe manager 0
2 2014-01-01 A Zoe senior_manager 0
3 2015-01-01 A Zoe senior_manager 0
4 2015-01-01 B Zoe director 1
5 2016-01-01 B Zoe director 1
6 2017-01-01 C Zoe director 1
7 2013-01-01 A Morgan manager 0
8 2014-01-01 A Morgan senior_manager 0
9 2015-01-01 A Morgan director 0
10 2016-01-01 A Morgan director 0
11 2017-01-01 A Morgan director 0
12 2013-01-01 A John manager 0
13 2014-01-01 B John senior_manager 1
14 2015-01-01 A John director 1
15 2016-01-01 A John director 1
16 2017-01-01 A John director 1