How can I create a dummy variable based on text analysis and time sequence of events?-CodePudding

Coworkers	Date
A	2011-01-01
D	2011-01-02
B;;D	2011-01-03
E;;F	2011-01-04
D	2012-11-05
D;;G	2012-11-06
A	2012-11-09

Hello, I am trying to create a dummy variable based on text analysis (e.g., grepl). The unit of analysis is a project, and the two main variables are coworkers (text) and date.

I am curious if there is any way that I can create a dummy variable that indicates all projects done by a worker who collaborated with "B" (in this example, D) as 1. One more KEY condition I would like to add is... I would like to give the value of 1 only to projects that occurred AFTER B and D worked together. I mean, in this case, I want to mark the project in the second row, which is done by D, as 0 because it occurred before B and D met.

Can I create this type of variable using R commands? As I have millions of observations, I would not be able to do it manually. Alphabets in the table are texts.

Thank you!

PS. In the "Coworkers" column, coworkers are separated by ;;

CodePudding user response：

Here is one possible solution with tidyverse (at least I think this is what you are looking for). First, I create a new column (i.e., flag) that indicates whether B has co-worked with D, and if so, then I assign a 1. Next, I use cummax which will everything 1 after the first row of B;;D. This creates two groups for the before and after. Next, I use case_when to change the first occurrence to 0, as you specified. Then, for any row with D, I change to 1 and all others are changed to 0.

library(tidyverse)

df %>%
  mutate(flag = ifelse(str_detect(Coworkers, "B;;D") | str_detect(Coworkers, "D;;B"), 1, 0),
         flag = cummax(flag == 1),
         flag = case_when(flag != 0 & !duplicated(flag) ~ 0,
                          grepl("D", Coworkers) & flag == 1 ~ 1,
                          TRUE ~ 0))

Output

  Coworkers       Date flag
1         A 2011-01-01    0
2         D 2011-01-02    0
3      B;;D 2011-01-03    0
4      E;;F 2011-01-04    0
5         D 2012-11-05    1
6      D;;G 2012-11-06    1
7         A 2012-11-09    0
8         B 2012-12-09    0
9      C;;B 2012-12-09    0

Data

df <- structure(list(Coworkers = c("A", "D", "B;;D", "E;;F", "D", "D;;G", 
"A", "B", "C;;B"), Date = c("2011-01-01", "2011-01-02", "2011-01-03", 
"2011-01-04", "2012-11-05", "2012-11-06", "2012-11-09", "2012-12-09", 
"2012-12-09")), class = "data.frame", row.names = c(NA, -9L))