id | country | start_date | end_date | var1 | var2 |
---|---|---|---|---|---|
xx1 | xx | 01/12/2020 | 25/12/2020 | 1 | 4 |
yy1 | yy | 03/12/2020 | 25/12/2020 | 2 | 3 |
xx2 | xx | 27/12/2020 | 10/03/2022 | 4 | 2 |
I need to convert a periodic dataset (N 400) formatted like this into country/day while maintaining the var1 and var2 observations across each day within the periods.
I need a new approach since using loops and tsfill
in Stata to replace dates doesn't work, i.e. the code runs but the data is still missing information in all rows between periods since neither id
nor country
is present in the imputed dates. Similarly, I've tried the following method in purr package in R:
res1 <- melt(setNames(lapply(1:nrow(df), function(x) seq(df[x, "start_date"],
df[x, "end_date"], by = "1 day")), df$country))
In the end, it should look like this:
id | country | date | var1 | var2 |
---|---|---|---|---|
xx1 | xx | 01/12/2020 | 1 | 4 |
xx1 | xx | 02/12/2020 | 1 | 4 |
xx1 | xx | 03/12/2020 | 1 | 4 |
yy1 | yy | 03/12/2020 | 2 | 3 |
....
id | country | date | var1 | var2 |
---|---|---|---|---|
xx1 | xx | 25/12/2020 | 1 | 4 |
yy1 | yy | 25/12/2020 | 2 | 3 |
xx1 | xx | 26/12/2020 | . | . |
yy1 | yy | 26/12/2020 | . | . |
xx2 | xx | 27/12/2020 | 4 | 2 |
yy1 | yy | 27/12/2020 | . | : |
I'm not familiar enough with pandas, however I was told that this may be the best way to resolve this issue.
CodePudding user response:
There is no such thing as the way or the best way to do this. It should be quite easy in any environment for data management worth discussion.
For a Stata solution, note that the example is ambiguous over whether your date variables are string or already properly formatted numeric date variables. This assumes the first, but the calls to daily()
should be omitted otherwise.
* Example generated by -dataex-. For more info, type help dataex
clear
input str3 id str2 country str10(start_date end_date) byte(var1 var2)
"xx1" "xx" "01/12/2020" "25/12/2020" 1 4
"yy1" "yy" "03/12/2020" "25/12/2020" 2 3
"xx2" "xx" "27/12/2020" "10/03/2022" 4 2
end
gen s_date = daily(start_date, "DMY")
gen e_date = daily(end_date, "DMY")
gen duration = e_date - s_date 1
expand duration
bysort id country: gen date = s_date _n - 1
format %tdDD/NN/CCYY date
list id country start end date var1 var2 in 1/10
--------------------------------------------------------------------
| id country start_date end_date date var1 var2 |
|--------------------------------------------------------------------|
1. | xx1 xx 01/12/2020 25/12/2020 01/12/2020 1 4 |
2. | xx1 xx 01/12/2020 25/12/2020 02/12/2020 1 4 |
3. | xx1 xx 01/12/2020 25/12/2020 03/12/2020 1 4 |
4. | xx1 xx 01/12/2020 25/12/2020 04/12/2020 1 4 |
5. | xx1 xx 01/12/2020 25/12/2020 05/12/2020 1 4 |
|--------------------------------------------------------------------|
6. | xx1 xx 01/12/2020 25/12/2020 06/12/2020 1 4 |
7. | xx1 xx 01/12/2020 25/12/2020 07/12/2020 1 4 |
8. | xx1 xx 01/12/2020 25/12/2020 08/12/2020 1 4 |
9. | xx1 xx 01/12/2020 25/12/2020 09/12/2020 1 4 |
10. | xx1 xx 01/12/2020 25/12/2020 10/12/2020 1 4 |
--------------------------------------------------------------------
CodePudding user response:
An R approach using dplyr::group_split
to split by id
and tidyr::complete
to add obs. for all days may look like so:
library(dplyr)
library(tidyr)
library(purrr)
dat %>%
mutate(across(c(start_date, end_date), as.Date, format = "%d/%m/%Y")) %>%
dplyr::group_split(id) %>%
purrr::map_dfr(function(x) {
mutate(x, date = NA) %>%
tidyr::complete(date = seq.Date(start_date, end_date, by = "day"), id = id, country = country, var1 = var1, var2 = var2) %>%
select(-start_date, -end_date)
})
#> # A tibble: 490 × 5
#> date id country var1 var2
#> <date> <chr> <chr> <int> <int>
#> 1 2020-12-01 xx1 xx 1 4
#> 2 2020-12-02 xx1 xx 1 4
#> 3 2020-12-03 xx1 xx 1 4
#> 4 2020-12-04 xx1 xx 1 4
#> 5 2020-12-05 xx1 xx 1 4
#> 6 2020-12-06 xx1 xx 1 4
#> 7 2020-12-07 xx1 xx 1 4
#> 8 2020-12-08 xx1 xx 1 4
#> 9 2020-12-09 xx1 xx 1 4
#> 10 2020-12-10 xx1 xx 1 4
#> # … with 480 more rows
DATA
dat <- data.frame(
stringsAsFactors = FALSE,
id = c("xx1", "yy1", "xx2"),
country = c("xx", "yy", "xx"),
start_date = c("01/12/2020", "03/12/2020", "27/12/2020"),
end_date = c("25/12/2020", "25/12/2020", "10/03/2022"),
var1 = c(1L, 2L, 4L),
var2 = c(4L, 3L, 2L)
)