converting transitions in panel structures into string sequences-CodePudding

I have a panel data that records the employment status of individuals across different years. Many of them change jobs over the time span of my data. I want to capture these transitions and merge them into string sequences. For example:

Year Person Employment_Status
1990 Bob    High School Teacher 
1991 Bob    High School Teacher 
1992 Bob    Freelancer
1993 Bob    High School Teacher 
1990 Peter  Singer
1991 Peter  Singer
1990 James  Actor
1991 James  Actor
1992 James  Producer
1993 James  Producer
1994 James  Investor

The ideal output should look like below:

Person Job_Sequence
Bob    High School Teacher-Freelancer-High School Teacher 
Peter  Singer
James  Actor-Producer-Investor

Essentially, each person is reduced to one row of record. The challenge for me is that different people have different number of transitions (ranging from zero to a dozen).

CodePudding user response：

We may apply rleid on 'Employment_Status' to group adjacent elements that are same as a single group, get the distinct elements of 'Person', 'grp', and do a group by paste

library(dplyr)
library(data.table)
df1 %>%
   mutate(grp = rleid(Employment_Status)) %>%
   distinct(Person, grp, .keep_all = TRUE) %>%
   group_by(Person) %>%
   summarise(Job_Sequence = str_c(Employment_Status,
     collapse = '-'), .groups = 'drop')

-output

# A tibble: 3 × 2
  Person Job_Sequence                                      
  <chr>  <chr>                                             
1 Bob    High School Teacher-Freelancer-High School Teacher
2 James  Actor-Producer-Investor                           
3 Peter  Singer

Or using base R

aggregate(cbind(Job_Sequence = Employment_Status) ~ Person, 
  subset(df1, !duplicated(with(rle(Employment_Status), 
   rep(seq_along(values), lengths)))), FUN = paste, collapse = '-')

-output

 Person                                       Job_Sequence
1    Bob High School Teacher-Freelancer-High School Teacher
2  James                            Actor-Producer-Investor
3  Peter                                             Singer

data

df1 <- structure(list(Year = c(1990L, 1991L, 1992L, 1993L, 1990L, 1991L, 
1990L, 1991L, 1992L, 1993L, 1994L), Person = c("Bob", "Bob", 
"Bob", "Bob", "Peter", "Peter", "James", "James", "James", "James", 
"James"), Employment_Status = c("High School Teacher", "High School Teacher", 
"Freelancer", "High School Teacher", "Singer", "Singer", "Actor", 
"Actor", "Producer", "Producer", "Investor")), 
class = "data.frame", row.names = c(NA, 
-11L))