I have a panel data that records the employment status of individuals across different years. Many of them change jobs over the time span of my data. I want to capture these transitions and merge them into string sequences. For example:
Year Person Employment_Status
1990 Bob High School Teacher
1991 Bob High School Teacher
1992 Bob Freelancer
1993 Bob High School Teacher
1990 Peter Singer
1991 Peter Singer
1990 James Actor
1991 James Actor
1992 James Producer
1993 James Producer
1994 James Investor
The ideal output should look like below:
Person Job_Sequence
Bob High School Teacher-Freelancer-High School Teacher
Peter Singer
James Actor-Producer-Investor
Essentially, each person is reduced to one row of record. The challenge for me is that different people have different number of transitions (ranging from zero to a dozen).
CodePudding user response:
We may apply rleid
on 'Employment_Status' to group adjacent elements that are same as a single group, get the distinct
elements of 'Person', 'grp', and do a group by paste
library(dplyr)
library(data.table)
df1 %>%
mutate(grp = rleid(Employment_Status)) %>%
distinct(Person, grp, .keep_all = TRUE) %>%
group_by(Person) %>%
summarise(Job_Sequence = str_c(Employment_Status,
collapse = '-'), .groups = 'drop')
-output
# A tibble: 3 × 2
Person Job_Sequence
<chr> <chr>
1 Bob High School Teacher-Freelancer-High School Teacher
2 James Actor-Producer-Investor
3 Peter Singer
Or using base R
aggregate(cbind(Job_Sequence = Employment_Status) ~ Person,
subset(df1, !duplicated(with(rle(Employment_Status),
rep(seq_along(values), lengths)))), FUN = paste, collapse = '-')
-output
Person Job_Sequence
1 Bob High School Teacher-Freelancer-High School Teacher
2 James Actor-Producer-Investor
3 Peter Singer
data
df1 <- structure(list(Year = c(1990L, 1991L, 1992L, 1993L, 1990L, 1991L,
1990L, 1991L, 1992L, 1993L, 1994L), Person = c("Bob", "Bob",
"Bob", "Bob", "Peter", "Peter", "James", "James", "James", "James",
"James"), Employment_Status = c("High School Teacher", "High School Teacher",
"Freelancer", "High School Teacher", "Singer", "Singer", "Actor",
"Actor", "Producer", "Producer", "Investor")),
class = "data.frame", row.names = c(NA,
-11L))