I have a dataset PosNeg like this. I need to find total number of days between first consecutive positives for each ID. By consecutive P's, I mean all consecutive P's before 1st negative (N).
Eg: for ID 1, 1st P is on 2021-01-08, so days between 01/08 and 03/026. For ID 2 and 3, there is only 1 consecutive first P, so 0 days for both. For ID 4, 1st consecutive P starts from 2021-02-15 to 2021-03-18 so total days are 31.
I tried some code but it is valid for only one ID at a time and so will have to put all IDs manually. I have around 50,000 ID's in the real dataset. Any suggestions in either Python/R/SQL would be helpful.
ID | Test | Date |
---|---|---|
1 | N | 2021-01-02 |
1 | P | 2021-01-08 |
1 | P | 2021-02-25 |
1 | P | 2021-03-26 |
2 | P | 2021-02-05 |
2 | N | 2021-03-04 |
2 | P | 2021-03-30 |
3 | N | 2021-01-24 |
3 | P | 2021-02-10 |
4 | P | 2021-02-15 |
4 | P | 2021-02-28 |
4 | P | 2021-03-18 |
4 | N | 2021-04-11 |
Output:
ID | Days |
---|---|
1 | 77 |
2 | 0 |
3 | 0 |
4 | 31 |
CodePudding user response:
You can write a function that only impacts the first P
from which we can filter using it:
fn <- function(x){
r <- rle(x)
is.na(r$values) <- which(r$values == 'P')[1]
inverse.rle(r)
}
data %>%
group_by(ID) %>%
filter(is.na(fn(Test))) %>%
summarise(days = sum(diff(as.Date(Date))))
# A tibble: 4 x 2
ID days
<int> <drtn>
1 1 77 days
2 2 0 days
3 3 0 days
4 4 31 days
CodePudding user response:
In case you want a Python alternative using Pandas:
import pandas as pd
# ... load as DataFrame...
df['Date'] = pd.to_datetime(df['Date'])
df['consec'] = df['Test'].ne(df['Test'].shift()).cumsum().where(df['Test'].eq('P'))
groups = df.dropna(subset='consec').groupby(['ID', 'consec'])
result = (groups.tail(1).set_index('ID')['Date'] - groups.head(1).set_index('ID')['Date']).reset_index().drop_duplicates(subset='ID')
print(result.rename(columns={'Date':'Days'}))
ID Days
0 1 77 days
1 2 0 days
3 3 0 days
4 4 31 days
CodePudding user response:
library(tidyverse)
data <- tribble(
~ID, ~Test, ~Date,
1L, "N", "2021-01-02",
1L, "P", "2021-01-08",
1L, "P", "2021-02-25",
1L, "P", "2021-03-26",
2L, "P", "2021-02-05",
2L, "N", "2021-03-04",
2L, "P", "2021-03-30",
3L, "N", "2021-01-24",
3L, "P", "2021-02-10",
4L, "P", "2021-02-15",
4L, "P", "2021-02-28",
4L, "P", "2021-03-18",
4L, "N", "2021-04-11"
)
data %>%
type_convert() %>%
group_by(ID) %>%
filter(Test == "P") %>%
arrange(Date) %>%
slice(1:3) %>%
mutate(step = row_number()) %>%
pivot_wider(names_from = step, values_from = Date) %>%
summarise(Days = (`3` - `1`) %>% replace_na(as.difftime(0, units = "days")))
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> Test = col_character(),
#> Date = col_date(format = "")
#> )
#> # A tibble: 4 × 2
#> ID Days
#> <int> <drtn>
#> 1 1 77 days
#> 2 2 0 days
#> 3 3 0 days
#> 4 4 31 days
Created on 2022-05-19 by the reprex package (v2.0.0)