Home > Back-end >  Converting periodic panel data to country/day time sequence using Stata/purr/pandas
Converting periodic panel data to country/day time sequence using Stata/purr/pandas

Time:09-26

id country start_date end_date var1 var2
xx1 xx 01/12/2020 25/12/2020 1 4
yy1 yy 03/12/2020 25/12/2020 2 3
xx2 xx 27/12/2020 10/03/2022 4 2

I need to convert a periodic dataset (N 400) formatted like this into country/day while maintaining the var1 and var2 observations across each day within the periods.

I need a new approach since using loops and tsfill in Stata to replace dates doesn't work, i.e. the code runs but the data is still missing information in all rows between periods since neither id nor country is present in the imputed dates. Similarly, I've tried the following method in purr package in R:

res1 <- melt(setNames(lapply(1:nrow(df), function(x) seq(df[x, "start_date"],
                                                           df[x, "end_date"], by = "1 day")), df$country))

In the end, it should look like this:

id country date var1 var2
xx1 xx 01/12/2020 1 4
xx1 xx 02/12/2020 1 4
xx1 xx 03/12/2020 1 4
yy1 yy 03/12/2020 2 3

....

id country date var1 var2
xx1 xx 25/12/2020 1 4
yy1 yy 25/12/2020 2 3
xx1 xx 26/12/2020 . .
yy1 yy 26/12/2020 . .
xx2 xx 27/12/2020 4 2
yy1 yy 27/12/2020 . :

I'm not familiar enough with pandas, however I was told that this may be the best way to resolve this issue.

CodePudding user response:

There is no such thing as the way or the best way to do this. It should be quite easy in any environment for data management worth discussion.

For a Stata solution, note that the example is ambiguous over whether your date variables are string or already properly formatted numeric date variables. This assumes the first, but the calls to daily() should be omitted otherwise.

* Example generated by -dataex-. For more info, type help dataex
clear
input str3 id str2 country str10(start_date end_date) byte(var1 var2)
"xx1" "xx" "01/12/2020" "25/12/2020" 1 4
"yy1" "yy" "03/12/2020" "25/12/2020" 2 3
"xx2" "xx" "27/12/2020" "10/03/2022" 4 2
end

gen s_date = daily(start_date, "DMY")
gen e_date = daily(end_date, "DMY")
gen duration = e_date - s_date   1 
expand duration 
bysort id country: gen date = s_date   _n - 1 
format %tdDD/NN/CCYY date 

list id country start end date var1 var2 in 1/10

      -------------------------------------------------------------------- 
     |  id   country   start_date     end_date         date   var1   var2 |
     |--------------------------------------------------------------------|
  1. | xx1        xx   01/12/2020   25/12/2020   01/12/2020      1      4 |
  2. | xx1        xx   01/12/2020   25/12/2020   02/12/2020      1      4 |
  3. | xx1        xx   01/12/2020   25/12/2020   03/12/2020      1      4 |
  4. | xx1        xx   01/12/2020   25/12/2020   04/12/2020      1      4 |
  5. | xx1        xx   01/12/2020   25/12/2020   05/12/2020      1      4 |
     |--------------------------------------------------------------------|
  6. | xx1        xx   01/12/2020   25/12/2020   06/12/2020      1      4 |
  7. | xx1        xx   01/12/2020   25/12/2020   07/12/2020      1      4 |
  8. | xx1        xx   01/12/2020   25/12/2020   08/12/2020      1      4 |
  9. | xx1        xx   01/12/2020   25/12/2020   09/12/2020      1      4 |
 10. | xx1        xx   01/12/2020   25/12/2020   10/12/2020      1      4 |
      -------------------------------------------------------------------- 

CodePudding user response:

An R approach using dplyr::group_split to split by id and tidyr::complete to add obs. for all days may look like so:

library(dplyr)
library(tidyr)
library(purrr)

dat %>%
  mutate(across(c(start_date, end_date), as.Date, format = "%d/%m/%Y")) %>%
  dplyr::group_split(id) %>%
  purrr::map_dfr(function(x) { 
    mutate(x, date = NA) %>% 
           tidyr::complete(date = seq.Date(start_date, end_date, by = "day"), id = id, country = country, var1 = var1, var2 = var2) %>%
      select(-start_date, -end_date)
    })
#> # A tibble: 490 × 5
#>    date       id    country  var1  var2
#>    <date>     <chr> <chr>   <int> <int>
#>  1 2020-12-01 xx1   xx          1     4
#>  2 2020-12-02 xx1   xx          1     4
#>  3 2020-12-03 xx1   xx          1     4
#>  4 2020-12-04 xx1   xx          1     4
#>  5 2020-12-05 xx1   xx          1     4
#>  6 2020-12-06 xx1   xx          1     4
#>  7 2020-12-07 xx1   xx          1     4
#>  8 2020-12-08 xx1   xx          1     4
#>  9 2020-12-09 xx1   xx          1     4
#> 10 2020-12-10 xx1   xx          1     4
#> # … with 480 more rows

DATA

dat <- data.frame(
  stringsAsFactors = FALSE,
                id = c("xx1", "yy1", "xx2"),
           country = c("xx", "yy", "xx"),
        start_date = c("01/12/2020", "03/12/2020", "27/12/2020"),
          end_date = c("25/12/2020", "25/12/2020", "10/03/2022"),
              var1 = c(1L, 2L, 4L),
              var2 = c(4L, 3L, 2L)
)
  • Related