I'm quite new to R so maybe this seems simple but I can't figure it out. My data looks like Df and it needs to look like Df2:
Df <- data.frame(country = c("A", "A", "A", "A", "A", "B","B", "B", "B"),
year = c("1950", "1951", "1952", "1953", "1954", "1950", "1951", "1952", "1953"),
start_year = c("NA", "1951", "1951", "NA", "1954", "1950", "NA", "1951", "1951"),
end_year= c("NA", "NA", "1952", "NA", "1954", "1950", "NA", "NA", "NA"),
status = c(0, 1, 1, 0, 1, 1, 0, 1, 1),
treatment = c(10, "NA", 20, 5, "NA", "NA", 30, 100, 10))
Df2 <- data.frame(country = c("A", "A", "A", "A", "B","B", "B"),
time1 = c("1950", "1951", "1953", "1954", "1950", "1951", "1952"),
time2 = c("1951", "1953", "1954", "1955", "1951", "1952", "1954"),
status = c(0, 1, 0, 1, 1, 0, 1),
treatment = c(10, 20, 0, "NA", "NA", 30, 110))
The goal is to have it in a structure to do a PWP recurrent event analysis. Treatment in Df2 should be the sum of the treatment values of the intervals time1 to time 2.
Any ideas how I can get there? Thank you!
CodePudding user response:
You could use
library(dplyr)
Df %>%
mutate(across(where(is.character), ~na_if(.x, "NA")),
time1 = as.numeric(coalesce(start_year, year)),
treatment = as.numeric(treatment)) %>%
group_by(country, time1, status) %>%
summarise(treatment = sum(treatment, na.rm = TRUE), .groups = "drop") %>%
group_by(country) %>%
mutate(time2 = lead(time1, default = last(time1) 1)) %>%
select(country, time1, time2, status, treatment) %>%
ungroup()
to get
# A tibble: 7 x 5
country time1 time2 status treatment
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1950 1951 0 10
2 A 1951 1953 1 20
3 A 1953 1954 0 5
4 A 1954 1955 1 0
5 B 1950 1951 1 0
6 B 1951 1951 0 30
7 B 1951 1952 1 110
That's not exactly your desired output (see my comment), but a start to solve your problem.
CodePudding user response:
Df2 <- Df %>% mutate(episode = data.table::rleid(status))
library(tidyverse)
Df2 <- Df2 %>%
arrange(country, year) %>%
group_by(country, episode) %>%
mutate(time1 = min(year))
Df2 <- Df2 %>%
arrange(country, year) %>%
group_by(country, episode) %>%
mutate(time2 = (max(as.numeric(year) 1)))
I have created an episode identifier and managed to identify time1
and time2
for each episode
. Now, I still need to combine the rows grouped by episode
so that there is one row per episode that shows the sum of treatment
. Any ideas how to do that?