I regularly have the problem that I have data describing the exact change of a state (in this case a project phase) and I want to historise this data.
That means I want to be able to describe at some point in the past in which phase which project was.
Lets use this test data:
library(data.table)
library(lubridate)
df = structure(list(Project = c("A", "A", "B", "B", "B", "B", "C",
"C", "C", "D", "D"), Date = structure(c(18659, 18748, 18687,
18718, 18748, 18871, 18718, 18718, 18779, 18659, 18840), class = "Date"),
Phase = c("Init", "P2O", "Init", "P2O", "Build", "Doc", "Init",
"P2O", "Build", "Init", "P2O")), row.names = c(NA, -11L), class = c("data.table",
"data.frame"))
df[, Date:=ceiling_date(as.Date(Date, format="%d.%m.%Y"), "month")]
They look like:
Project Date Phase
Project Date Phase
1: A 2021-03-01 Init
2: A 2021-06-01 P2O
3: B 2021-04-01 Init
4: B 2021-05-01 P2O
5: B 2021-06-01 Build
6: B 2021-10-01 Doc
7: C 2021-05-01 Init
8: C 2021-05-01 P2O
9: C 2021-07-01 Build
10: D 2021-03-01 Init
11: D 2021-09-01 P2O
Now I want to create a table which has for each Phase a Col and for each Date a Row.
I tried:
> dcast(df, Date~Phase, fun=length)[order(Date)]
Date Build Doc Init P2O
1: 2021-03-01 0 0 2 0
2: 2021-04-01 0 0 1 0
3: 2021-05-01 0 0 1 2
4: 2021-06-01 1 0 0 1
5: 2021-07-01 1 0 0 0
6: 2021-09-01 0 0 0 1
7: 2021-10-01 0 1 0 0
But here is a big mistake. Let's take a look just at project A.
Project Date Phase
1: A 2021-03-01 Init
2: A 2021-06-01 P2O
Project A got into the "Init" phase on 01.03.2021 and "P2O" on 01.06.2021. In the current logic, this would look like :
Date Build Doc Init P2O
1: 2021-03-01 0 0 1 0
4: 2021-06-01 0 0 0 1
But that is wrong. Between March and June this project was still in Init, so correctly this should be like:
Date Build Doc Init P2O
1: 2021-03-01 0 0 1 0
2: 2021-04-01 0 0 1 0
3: 2021-05-01 0 0 1 0
4: 2021-06-01 0 0 0 1
Has anybody an idea how I could solve this problem?
CodePudding user response:
Perhaps this?
dfwide <- dcast(df, Project Date ~ Phase, fun = length)
merge(
dfwide,
dfwide[, .(Date = seq(min(Date), max(Date), by = "month")), by = .(Project)],
by = c("Project", "Date"), all = TRUE
)[order(Date),
][, setNames(nafill(.SD, type = "locf"), names(.SD)), by = .(Project)]
# Project Date Build Doc Init P2O
# <char> <Date> <int> <int> <int> <int>
# 1: A 2021-03-01 0 0 1 0
# 2: A 2021-04-01 0 0 1 0
# 3: A 2021-05-01 0 0 1 0
# 4: A 2021-06-01 0 0 0 1
# 5: D 2021-03-01 0 0 1 0
# 6: D 2021-04-01 0 0 1 0
# 7: D 2021-05-01 0 0 1 0
# 8: D 2021-06-01 0 0 1 0
# 9: D 2021-07-01 0 0 1 0
# 10: D 2021-08-01 0 0 1 0
# 11: D 2021-09-01 0 0 0 1
# 12: B 2021-04-01 0 0 1 0
# 13: B 2021-05-01 0 0 0 1
# 14: B 2021-06-01 1 0 0 0
# 15: B 2021-07-01 1 0 0 0
# 16: B 2021-08-01 1 0 0 0
# 17: B 2021-09-01 1 0 0 0
# 18: B 2021-10-01 0 1 0 0
# 19: C 2021-05-01 0 0 1 1
# 20: C 2021-06-01 0 0 1 1
# 21: C 2021-07-01 1 0 0 0
# Project Date Build Doc Init P2O
Same without the literal merge
(though we're just using a data.table
-left-join instead)
dfwide[
dfwide[, .(Date = unique(c(seq(min(Date), max(Date), by = "month"), max(Date)))), by = .(Project)],
on = .(Project, Date)][order(Date),
][, setNames(nafill(.SD, type = "locf"), names(.SD)), by = .(Project)]
The use of unique(c(..., max(Date)))
is to ensure that the max date is always used/preserved; if any of seq(.)
falls short of the max date without landing on it, then the max date may not be included. When we used merge(.., all=TRUE)
this wasn't a problem, since it would be preserved anyway, but using a left-join, it might be omitted. (Not in this data, though. I'm just being defensive.)