Find number of days between two dates with condition-CodePudding

I have a dataset PosNeg like this. I need to find total number of days between first consecutive positives for each ID. By consecutive P's, I mean all consecutive P's before 1st negative (N).

Eg: for ID 1, 1st P is on 2021-01-08, so days between 01/08 and 03/026. For ID 2 and 3, there is only 1 consecutive first P, so 0 days for both. For ID 4, 1st consecutive P starts from 2021-02-15 to 2021-03-18 so total days are 31.

I tried some code but it is valid for only one ID at a time and so will have to put all IDs manually. I have around 50,000 ID's in the real dataset. Any suggestions in either Python/R/SQL would be helpful.

ID	Test	Date
1	N	2021-01-02
1	P	2021-01-08
1	P	2021-02-25
1	P	2021-03-26
2	P	2021-02-05
2	N	2021-03-04
2	P	2021-03-30
3	N	2021-01-24
3	P	2021-02-10
4	P	2021-02-15
4	P	2021-02-28
4	P	2021-03-18
4	N	2021-04-11

Output:

ID	Days
1	77
2	0
3	0
4	31

CodePudding user response：

You can write a function that only impacts the first P from which we can filter using it:

fn <- function(x){
  r <- rle(x)
  is.na(r$values) <- which(r$values == 'P')[1]
  inverse.rle(r)
}

data %>%
  group_by(ID) %>%
  filter(is.na(fn(Test))) %>%
  summarise(days = sum(diff(as.Date(Date))))

# A tibble: 4 x 2
     ID days   
  <int> <drtn> 
1     1 77 days
2     2  0 days
3     3  0 days
4     4 31 days

CodePudding user response：

In case you want a Python alternative using Pandas:

import pandas as pd 

# ... load as DataFrame...

df['Date'] = pd.to_datetime(df['Date'])

df['consec'] = df['Test'].ne(df['Test'].shift()).cumsum().where(df['Test'].eq('P'))
groups = df.dropna(subset='consec').groupby(['ID', 'consec'])
result = (groups.tail(1).set_index('ID')['Date'] - groups.head(1).set_index('ID')['Date']).reset_index().drop_duplicates(subset='ID')

print(result.rename(columns={'Date':'Days'}))

   ID    Days
0   1 77 days
1   2  0 days
3   3  0 days
4   4 31 days

CodePudding user response：

library(tidyverse)

data <- tribble(
  ~ID, ~Test, ~Date,
  1L, "N", "2021-01-02",
  1L, "P", "2021-01-08",
  1L, "P", "2021-02-25",
  1L, "P", "2021-03-26",
  2L, "P", "2021-02-05",
  2L, "N", "2021-03-04",
  2L, "P", "2021-03-30",
  3L, "N", "2021-01-24",
  3L, "P", "2021-02-10",
  4L, "P", "2021-02-15",
  4L, "P", "2021-02-28",
  4L, "P", "2021-03-18",
  4L, "N", "2021-04-11"
)

data %>%
  type_convert() %>%
  group_by(ID) %>%
  filter(Test == "P") %>%
  arrange(Date) %>%
  slice(1:3) %>%
  mutate(step = row_number()) %>%
  pivot_wider(names_from = step, values_from = Date) %>%
  summarise(Days = (`3` - `1`) %>% replace_na(as.difftime(0, units = "days")))
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   Test = col_character(),
#>   Date = col_date(format = "")
#> )
#> # A tibble: 4 × 2
#>      ID Days   
#>   <int> <drtn> 
#> 1     1 77 days
#> 2     2  0 days
#> 3     3  0 days
#> 4     4 31 days

^{Created on 2022-05-19 by the reprex package (v2.0.0)}