Home > Software design >  Find total IDs between two dates that satisfies a condition
Find total IDs between two dates that satisfies a condition

Time:05-19

I have a dataset PosNeg like this. I need to find count of ID's who have a pattern like this- P N P P or N P N N P N - that is having at least one N (negative) between two P's(positive). If this pattern occurs at least once, then count that ID. Date is always in ascending order.

Eg: for ID 1, I have at least 1 N on 02/25 between two P's, Hence I'll count ID 1. ID 2 and 3 does not have N between 2 P's, so those are not counted. ID 4 also has one N between two P's on 03/18, hence I'll include 4. So the total number of ID's which satisfices the condition is 2(1 and 4)

My idea is to find min(date) of positive and max(date) of positive for each ID, and look for any negative in between these dates, but not sure how to implement it. Any suggestions in R/Python/SQL would be helpful.

ID Test Date
1 P 2021-01-02
1 P 2021-01-08
1 N 2021-02-25
1 P 2021-03-26
2 N 2021-02-05
2 P 2021-03-04
2 N 2021-03-30
3 N 2021-01-24
3 P 2021-02-10
4 N 2021-02-15
4 P 2021-02-28
4 N 2021-03-18
4 P 2021-04-11

Output:

Total
2

EDIT: There could be multiple N's (at least 1) between two P's, not just 1, and I want to include it in my count.

CodePudding user response:

library(dplyr)

dat %>%
  group_by(ID) %>%
  summarize(
    yourcond = any((Test == "N") & (lag(Test) == "P") & (lead(Test) == "P")))

result:

# A tibble: 4 x 2
     ID yourcond
  <int> <lgl>   
1     1 TRUE    
2     2 NA      
3     3 NA      
4     4 TRUE 

data:

dat <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 
4L, 4L, 4L), Test = c("P", "P", "N", "P", "N", "P", "N", "N", 
"P", "N", "P", "N", "P"), Date = c("2021-01-02", "2021-01-08", 
"2021-02-25", "2021-03-26", "2021-02-05", "2021-03-04", "2021-03-30", 
"2021-01-24", "2021-02-10", "2021-02-15", "2021-02-28", "2021-03-18", 
"2021-04-11")), class = "data.frame", row.names = c(NA, -13L))

Edit

You could add count(yourcond) to the dplyr chain to return counts of each NA and TRUE.

CodePudding user response:

Here is one option with str_c/str_detect - grouped by 'ID', paste the 'Test' elements and then check whether the pattern P followed by one or more N (N ) and then a P occurs

library(stringr)
library(dplyr)
df1 %>% 
  group_by(ID) %>% 
  summarise(isP = str_detect(str_c(Test, collapse = ""), "PN P"), 
    .groups = 'drop') %>% 
 filter(isP)
# A tibble: 2 × 2
     ID isP  
  <int> <lgl>
1     1 TRUE 
2     4 TRUE 

CodePudding user response:

In Pandas:

import pandas as pd

# ...load as DataFrame (df)...

# Do it if not sorted by default.
df = df.sort_values('Date')

ids_patterns = df.groupby('ID')['Test'].agg('sum').str.contains('PN P')

print(ids_patterns)
print(ids_patterns.sum())
ID
1     True
2    False
3    False
4     True
dtype: bool
2
  • Related