I have a data looks like below
name year_1 year_2 location job
Joe 1990 1991 Boston Engineer
Joe 1991 1992 Boston Engineer
Joe 1992 1996 New York Data Scientist
Joe 1996 1998 Chicago Data Scientist
Peter 2003 2004 Montreal HR
Peter 2004 2006 Singapore Marketing
There are some gaps in terms of year transitions. I want to fill the gap by adding rows to each person's record, which should look like something in below"
name year_1 year_2 location job
Joe 1990 1991 Boston Engineer
Joe 1991 1992 Boston Engineer
Joe 1992 1993 New York Data Scientist
Joe 1993 1994 New York Data Scientist
Joe 1994 1995 New York Data Scientist
Joe 1995 1996 New York Data Scientist
Joe 1996 1997 Chicago Data Scientist
Joe 1997 1998 Chicago Data Scientist
Peter 2003 2004 Montreal HR
Peter 2004 2005 Singapore Marketing
Peter 2005 2006 Singapore Marketing
CodePudding user response:
We could first group, then generate the relevant sequences, and unnest them:
library(dplyr)
library(tidyr)
df |>
group_by(name, location, job, year_1) |>
mutate(new_year_1 = if_else((year_2 - year_1) == 1, list(year_1), list(seq(from = year_1, to = year_2 - 1, by = 1))),
new_year_2 = if_else((year_2 - year_1) == 1, list(year_2), list(seq(from = year_1 1, to = year_2, by = 1)))) |>
unnest_longer(new_year_1:new_year_2) |>
ungroup() |>
select(name,
year_1 = new_year_1,
year_2 = new_year_2,
location,
job)
Output:
# A tibble: 11 × 5
name year_1 year_2 location job
<chr> <dbl> <dbl> <chr> <chr>
1 Joe 1990 1991 Boston Engineer
2 Joe 1991 1992 Boston Engineer
3 Joe 1992 1993 NewYork DataScientist
4 Joe 1993 1994 NewYork DataScientist
5 Joe 1994 1995 NewYork DataScientist
6 Joe 1995 1996 NewYork DataScientist
7 Joe 1996 1997 Chicago DataScientist
8 Joe 1997 1998 Chicago DataScientist
9 Peter 2003 2004 Montreal HR
10 Peter 2004 2005 Singapore Marketing
11 Peter 2005 2006 Singapore Marketing
Data:
library(readr)
df <- read_table("name year_1 year_2 location job
Joe 1990 1991 Boston Engineer
Joe 1991 1992 Boston Engineer
Joe 1992 1996 NewYork DataScientist
Joe 1996 1998 Chicago DataScientist
Peter 2003 2004 Montreal HR
Peter 2004 2006 Singapore Marketing")
CodePudding user response:
You can uncount()
based on the difference between year2 and year1 then adjust these variables:
library(dplyr)
library(tidyr)
dat %>%
uncount(year_2 - year_1) %>%
group_by(name, yr = year_1) %>%
mutate(year_1 = first(year_1) row_number() - 1,
year_2 = year_1 1) %>%
ungroup() %>%
select(-yr)
# A tibble: 11 × 5
name year_1 year_2 location job
<chr> <int> <dbl> <chr> <chr>
1 Joe 1990 1991 Boston Engineer
2 Joe 1991 1992 Boston Engineer
3 Joe 1992 1993 New York Data Scientist
4 Joe 1993 1994 New York Data Scientist
5 Joe 1994 1995 New York Data Scientist
6 Joe 1995 1996 New York Data Scientist
7 Joe 1996 1997 Chicago Data Scientist
8 Joe 1997 1998 Chicago Data Scientist
9 Peter 2003 2004 Montreal HR
10 Peter 2004 2005 Singapore Marketing
11 Peter 2005 2006 Singapore Marketing