I have a dataset PosNeg like this. I need to find count of ID's who have a pattern like this- P N P P or N P N N P N - that is having at least one N (negative) between two P's(positive). If this pattern occurs at least once, then count that ID. Date is always in ascending order.
Eg: for ID 1, I have at least 1 N on 02/25 between two P's, Hence I'll count ID 1. ID 2 and 3 does not have N between 2 P's, so those are not counted. ID 4 also has one N between two P's on 03/18, hence I'll include 4. So the total number of ID's which satisfices the condition is 2(1 and 4)
My idea is to find min(date) of positive and max(date) of positive for each ID, and look for any negative in between these dates, but not sure how to implement it. Any suggestions in R/Python/SQL would be helpful.
ID | Test | Date |
---|---|---|
1 | P | 2021-01-02 |
1 | P | 2021-01-08 |
1 | N | 2021-02-25 |
1 | P | 2021-03-26 |
2 | N | 2021-02-05 |
2 | P | 2021-03-04 |
2 | N | 2021-03-30 |
3 | N | 2021-01-24 |
3 | P | 2021-02-10 |
4 | N | 2021-02-15 |
4 | P | 2021-02-28 |
4 | N | 2021-03-18 |
4 | P | 2021-04-11 |
Output:
Total |
---|
2 |
EDIT: There could be multiple N's (at least 1) between two P's, not just 1, and I want to include it in my count.
CodePudding user response:
library(dplyr)
dat %>%
group_by(ID) %>%
summarize(
yourcond = any((Test == "N") & (lag(Test) == "P") & (lead(Test) == "P")))
result:
# A tibble: 4 x 2
ID yourcond
<int> <lgl>
1 1 TRUE
2 2 NA
3 3 NA
4 4 TRUE
data:
dat <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L,
4L, 4L, 4L), Test = c("P", "P", "N", "P", "N", "P", "N", "N",
"P", "N", "P", "N", "P"), Date = c("2021-01-02", "2021-01-08",
"2021-02-25", "2021-03-26", "2021-02-05", "2021-03-04", "2021-03-30",
"2021-01-24", "2021-02-10", "2021-02-15", "2021-02-28", "2021-03-18",
"2021-04-11")), class = "data.frame", row.names = c(NA, -13L))
Edit
You could add count(yourcond)
to the dplyr chain to return counts of each NA and TRUE.
CodePudding user response:
Here is one option with str_c/str_detect
- grouped by 'ID', paste
the 'Test' elements and then check whether the pattern P
followed by one or more N
(N
) and then a P
occurs
library(stringr)
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(isP = str_detect(str_c(Test, collapse = ""), "PN P"),
.groups = 'drop') %>%
filter(isP)
# A tibble: 2 × 2
ID isP
<int> <lgl>
1 1 TRUE
2 4 TRUE
CodePudding user response:
In Pandas:
import pandas as pd
# ...load as DataFrame (df)...
# Do it if not sorted by default.
df = df.sort_values('Date')
ids_patterns = df.groupby('ID')['Test'].agg('sum').str.contains('PN P')
print(ids_patterns)
print(ids_patterns.sum())
ID
1 True
2 False
3 False
4 True
dtype: bool
2