Home > other >  Data expansion based on information in year columns by group
Data expansion based on information in year columns by group

Time:12-09

I have a data looks like below

name  year_1  year_2  location  job  
Joe   1990    1991    Boston    Engineer
Joe   1991    1992    Boston    Engineer
Joe   1992    1996    New York  Data Scientist
Joe   1996    1998    Chicago   Data Scientist
Peter 2003    2004    Montreal  HR
Peter 2004    2006    Singapore Marketing

There are some gaps in terms of year transitions. I want to fill the gap by adding rows to each person's record, which should look like something in below"

name  year_1  year_2  location  job  
Joe   1990    1991    Boston    Engineer
Joe   1991    1992    Boston    Engineer
Joe   1992    1993    New York  Data Scientist
Joe   1993    1994    New York  Data Scientist
Joe   1994    1995    New York  Data Scientist
Joe   1995    1996    New York  Data Scientist
Joe   1996    1997    Chicago   Data Scientist
Joe   1997    1998    Chicago   Data Scientist
Peter 2003    2004    Montreal  HR
Peter 2004    2005    Singapore Marketing
Peter 2005    2006    Singapore Marketing

CodePudding user response:

We could first group, then generate the relevant sequences, and unnest them:

library(dplyr)
library(tidyr)

df |> 
  group_by(name, location, job, year_1) |>
  mutate(new_year_1 = if_else((year_2 - year_1) == 1, list(year_1), list(seq(from = year_1, to = year_2 - 1, by = 1))),
         new_year_2 = if_else((year_2 - year_1) == 1, list(year_2), list(seq(from = year_1   1, to = year_2, by = 1)))) |>
  unnest_longer(new_year_1:new_year_2) |>
  ungroup() |>
  select(name,
         year_1 = new_year_1,
         year_2 = new_year_2,
         location,
         job) 

Output:

# A tibble: 11 × 5
   name  year_1 year_2 location  job          
   <chr>  <dbl>  <dbl> <chr>     <chr>        
 1 Joe     1990   1991 Boston    Engineer     
 2 Joe     1991   1992 Boston    Engineer     
 3 Joe     1992   1993 NewYork   DataScientist
 4 Joe     1993   1994 NewYork   DataScientist
 5 Joe     1994   1995 NewYork   DataScientist
 6 Joe     1995   1996 NewYork   DataScientist
 7 Joe     1996   1997 Chicago   DataScientist
 8 Joe     1997   1998 Chicago   DataScientist
 9 Peter   2003   2004 Montreal  HR           
10 Peter   2004   2005 Singapore Marketing    
11 Peter   2005   2006 Singapore Marketing    

Data:

library(readr)

df <- read_table("name  year_1  year_2  location  job
Joe   1990    1991    Boston    Engineer
Joe   1991    1992    Boston    Engineer
Joe   1992    1996    NewYork  DataScientist
Joe   1996    1998    Chicago   DataScientist
Peter 2003    2004    Montreal  HR
Peter 2004    2006    Singapore Marketing")

CodePudding user response:

You can uncount() based on the difference between year2 and year1 then adjust these variables:

library(dplyr)
library(tidyr)

dat %>%
  uncount(year_2 - year_1) %>%
  group_by(name, yr = year_1) %>%
  mutate(year_1 = first(year_1)   row_number() - 1,
         year_2 = year_1   1) %>%
  ungroup() %>%
  select(-yr)

# A tibble: 11 × 5
   name  year_1 year_2 location  job           
   <chr>  <int>  <dbl> <chr>     <chr>         
 1 Joe     1990   1991 Boston    Engineer      
 2 Joe     1991   1992 Boston    Engineer      
 3 Joe     1992   1993 New York  Data Scientist
 4 Joe     1993   1994 New York  Data Scientist
 5 Joe     1994   1995 New York  Data Scientist
 6 Joe     1995   1996 New York  Data Scientist
 7 Joe     1996   1997 Chicago   Data Scientist
 8 Joe     1997   1998 Chicago   Data Scientist
 9 Peter   2003   2004 Montreal  HR            
10 Peter   2004   2005 Singapore Marketing     
11 Peter   2005   2006 Singapore Marketing  
  • Related