Coworkers | Date |
---|---|
A | 2011-01-01 |
D | 2011-01-02 |
B;;D | 2011-01-03 |
E;;F | 2011-01-04 |
D | 2012-11-05 |
D;;G | 2012-11-06 |
A | 2012-11-09 |
Hello, I am trying to create a dummy variable based on text analysis (e.g., grepl
).
The unit of analysis is a project, and the two main variables are coworkers (text) and date.
I am curious if there is any way that I can create a dummy variable that indicates all projects done by a worker who collaborated with "B" (in this example, D) as 1. One more KEY condition I would like to add is... I would like to give the value of 1 only to projects that occurred AFTER B and D worked together. I mean, in this case, I want to mark the project in the second row, which is done by D, as 0 because it occurred before B and D met.
Can I create this type of variable using R commands? As I have millions of observations, I would not be able to do it manually. Alphabets in the table are texts.
Thank you!
PS. In the "Coworkers" column, coworkers are separated by ;;
CodePudding user response:
Here is one possible solution with tidyverse
(at least I think this is what you are looking for). First, I create a new column (i.e., flag
) that indicates whether B
has co-worked with D
, and if so, then I assign a 1. Next, I use cummax
which will everything 1 after the first row of B;;D
. This creates two groups for the before and after. Next, I use case_when
to change the first occurrence to 0, as you specified. Then, for any row with D
, I change to 1 and all others are changed to 0.
library(tidyverse)
df %>%
mutate(flag = ifelse(str_detect(Coworkers, "B;;D") | str_detect(Coworkers, "D;;B"), 1, 0),
flag = cummax(flag == 1),
flag = case_when(flag != 0 & !duplicated(flag) ~ 0,
grepl("D", Coworkers) & flag == 1 ~ 1,
TRUE ~ 0))
Output
Coworkers Date flag
1 A 2011-01-01 0
2 D 2011-01-02 0
3 B;;D 2011-01-03 0
4 E;;F 2011-01-04 0
5 D 2012-11-05 1
6 D;;G 2012-11-06 1
7 A 2012-11-09 0
8 B 2012-12-09 0
9 C;;B 2012-12-09 0
Data
df <- structure(list(Coworkers = c("A", "D", "B;;D", "E;;F", "D", "D;;G",
"A", "B", "C;;B"), Date = c("2011-01-01", "2011-01-02", "2011-01-03",
"2011-01-04", "2012-11-05", "2012-11-06", "2012-11-09", "2012-12-09",
"2012-12-09")), class = "data.frame", row.names = c(NA, -9L))