Converting periodic panel data to country/day time sequence using Stata/purr/pandas-CodePudding

id	country	start_date	end_date	var1	var2
xx1	xx	01/12/2020	25/12/2020	1	4
yy1	yy	03/12/2020	25/12/2020	2	3
xx2	xx	27/12/2020	10/03/2022	4	2

I need to convert a periodic dataset (N 400) formatted like this into country/day while maintaining the var1 and var2 observations across each day within the periods.

I need a new approach since using loops and tsfill in Stata to replace dates doesn't work, i.e. the code runs but the data is still missing information in all rows between periods since neither id nor country is present in the imputed dates. Similarly, I've tried the following method in purr package in R:

res1 <- melt(setNames(lapply(1:nrow(df), function(x) seq(df[x, "start_date"],
                                                           df[x, "end_date"], by = "1 day")), df$country))

In the end, it should look like this:

id	country	date	var1	var2
xx1	xx	01/12/2020	1	4
xx1	xx	02/12/2020	1	4
xx1	xx	03/12/2020	1	4
yy1	yy	03/12/2020	2	3

....

id	country	date	var1	var2
xx1	xx	25/12/2020	1	4
yy1	yy	25/12/2020	2	3
xx1	xx	26/12/2020	.	.
yy1	yy	26/12/2020	.	.
xx2	xx	27/12/2020	4	2
yy1	yy	27/12/2020	.	:

I'm not familiar enough with pandas, however I was told that this may be the best way to resolve this issue.

CodePudding user response：

There is no such thing as the way or the best way to do this. It should be quite easy in any environment for data management worth discussion.

For a Stata solution, note that the example is ambiguous over whether your date variables are string or already properly formatted numeric date variables. This assumes the first, but the calls to daily() should be omitted otherwise.

* Example generated by -dataex-. For more info, type help dataex
clear
input str3 id str2 country str10(start_date end_date) byte(var1 var2)
"xx1" "xx" "01/12/2020" "25/12/2020" 1 4
"yy1" "yy" "03/12/2020" "25/12/2020" 2 3
"xx2" "xx" "27/12/2020" "10/03/2022" 4 2
end

gen s_date = daily(start_date, "DMY")
gen e_date = daily(end_date, "DMY")
gen duration = e_date - s_date   1 
expand duration 
bysort id country: gen date = s_date   _n - 1 
format %tdDD/NN/CCYY date 

list id country start end date var1 var2 in 1/10

      -------------------------------------------------------------------- 
     |  id   country   start_date     end_date         date   var1   var2 |
     |--------------------------------------------------------------------|
  1. | xx1        xx   01/12/2020   25/12/2020   01/12/2020      1      4 |
  2. | xx1        xx   01/12/2020   25/12/2020   02/12/2020      1      4 |
  3. | xx1        xx   01/12/2020   25/12/2020   03/12/2020      1      4 |
  4. | xx1        xx   01/12/2020   25/12/2020   04/12/2020      1      4 |
  5. | xx1        xx   01/12/2020   25/12/2020   05/12/2020      1      4 |
     |--------------------------------------------------------------------|
  6. | xx1        xx   01/12/2020   25/12/2020   06/12/2020      1      4 |
  7. | xx1        xx   01/12/2020   25/12/2020   07/12/2020      1      4 |
  8. | xx1        xx   01/12/2020   25/12/2020   08/12/2020      1      4 |
  9. | xx1        xx   01/12/2020   25/12/2020   09/12/2020      1      4 |
 10. | xx1        xx   01/12/2020   25/12/2020   10/12/2020      1      4 |
      --------------------------------------------------------------------

CodePudding user response：

An R approach using dplyr::group_split to split by id and tidyr::complete to add obs. for all days may look like so:

library(dplyr)
library(tidyr)
library(purrr)

dat %>%
  mutate(across(c(start_date, end_date), as.Date, format = "%d/%m/%Y")) %>%
  dplyr::group_split(id) %>%
  purrr::map_dfr(function(x) { 
    mutate(x, date = NA) %>% 
           tidyr::complete(date = seq.Date(start_date, end_date, by = "day"), id = id, country = country, var1 = var1, var2 = var2) %>%
      select(-start_date, -end_date)
    })
#> # A tibble: 490 × 5
#>    date       id    country  var1  var2
#>    <date>     <chr> <chr>   <int> <int>
#>  1 2020-12-01 xx1   xx          1     4
#>  2 2020-12-02 xx1   xx          1     4
#>  3 2020-12-03 xx1   xx          1     4
#>  4 2020-12-04 xx1   xx          1     4
#>  5 2020-12-05 xx1   xx          1     4
#>  6 2020-12-06 xx1   xx          1     4
#>  7 2020-12-07 xx1   xx          1     4
#>  8 2020-12-08 xx1   xx          1     4
#>  9 2020-12-09 xx1   xx          1     4
#> 10 2020-12-10 xx1   xx          1     4
#> # … with 480 more rows

DATA

dat <- data.frame(
  stringsAsFactors = FALSE,
                id = c("xx1", "yy1", "xx2"),
           country = c("xx", "yy", "xx"),
        start_date = c("01/12/2020", "03/12/2020", "27/12/2020"),
          end_date = c("25/12/2020", "25/12/2020", "10/03/2022"),
              var1 = c(1L, 2L, 4L),
              var2 = c(4L, 3L, 2L)
)