I am not sure if this task is of self join
or not. I am basically trying to lookup the latest date for each State.UnionTerritory
in below dataframe where the Daily_confirmed
cases for each of them were <= half of current Date.
This will help me to get the the doubling time of cases for each State on each date.
library(tidyverse)
library(lubridate)
df_ind <- read_csv("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/sample_data.csv")
df_ind %>% head()
# output
Date State.UnionTerritory Daily_confirmed
<date> <chr> <dbl>
1 2021-12-23 Haryana 46
2 2021-12-23 Maharashtra 1179
3 2021-12-23 Delhi 118
4 2021-12-22 Haryana 55
5 2021-12-22 Maharashtra 1201
6 2021-12-22 Delhi 125
For example Delhi
has 118
Cases on 2021-12-23
and less than or half of this for Delhi is coming as 57
on 2021-12-15
so doubling rate would be 2021-12-23
- 2021-12-15
= 8 days.
so I should get something like:
This should be applied for each State in the data & on all dates.
df_ind <- df_ind %>%
mutate(Daily_confirmed_half = as.integer(Daily_confirmed / 2) )
I am not sure how exactly I can try this to get proper Dates as the Doubling_Date
column where Daily_confirmed
Cases values met the condition of <= half
.
I can group summarize
& use first
to pull the latest dates but not sure what would be the efficient way of bringing the right dates in another column of this data frame.
CodePudding user response:
I'm not sure if this is your desired output. The approach uses a full_join
together with filter
to simulate an non-euqal join in dplyr. Then we do some data cleaning and in the last step we need a left_join
to our original data, since there are quite a couple of days, where we cannot calculate the doubling date, since its not included in the time series.
library(tidyverse)
library(lubridate)
df_ind %>%
group_by(State.UnionTerritory) %>%
full_join(., ., by = c("State.UnionTerritory")) %>%
filter(Date.x > Date.y,
Daily_confirmed.x > (Daily_confirmed.y * 2)) %>%
group_by(Date.x, State.UnionTerritory) %>%
filter(Date.y == max(Date.y)) %>%
filter(Daily_confirmed.y == max(Daily_confirmed.y)) %>%
rename("Date" = Date.x,
"Daily_confirmed" = Daily_confirmed.x,
"Doubling_Date" = Date.y) %>%
select(- Daily_confirmed.y) %>%
mutate(Day_to_double = Date - Doubling_Date) %>%
left_join(df_ind,
.,
by = c("Date", "State.UnionTerritory", "Daily_confirmed")) %>%
arrange(State.UnionTerritory, desc(Date))
#> # A tibble: 252 x 5
#> Date State.UnionTerritory Daily_confirmed Doubling_Date Day_to_double
#> <date> <chr> <dbl> <date> <drtn>
#> 1 2021-12-23 Delhi 118 2021-12-15 8 days
#> 2 2021-12-22 Delhi 125 2021-12-15 7 days
#> 3 2021-12-21 Delhi 102 2021-12-14 7 days
#> 4 2021-12-20 Delhi 91 2021-12-14 6 days
#> 5 2021-12-19 Delhi 107 2021-12-14 5 days
#> 6 2021-12-18 Delhi 86 2021-12-13 5 days
#> 7 2021-12-17 Delhi 69 2021-12-13 4 days
#> 8 2021-12-16 Delhi 85 2021-12-13 3 days
#> 9 2021-12-15 Delhi 57 2021-11-27 18 days
#> 10 2021-12-14 Delhi 45 2021-11-15 29 days
#> # … with 242 more rows
Created on 2021-12-25 by the reprex package (v0.3.0)