How to determine if a test was done how many years consecutively for individuals using pandas?-CodePudding

how do i determine if a certain population has met the conditions of having a test done consecutively for a few years based on their first test date? A example dataset can be found below. I'm thinking of using dataset 1 as df1 and Dataset 2 as df2 but my problem is I'm not sure how to use the first date of collection for different IDs to minus the different collection date of the same IDs?

Dataset 1:

ID	Date of Collection	Test Done
My-ID 00001	10/05/2016	1A
My-ID 00001	10/01/2017	1A
My-ID 00001	23/01/2018	1A
My-ID 00001	18/04/2019	1A
My-ID 00001	30/04/2020	1A
My-ID 00002	30/09/2015	1A
My-ID 00002	31/05/2016	1A
My-ID 00002	31/05/2017	1A
My-ID 00003	31/05/2017	1A

Dataset 2:

ID	Test Done	Result	Date of Collection
My-ID 00001	1A	50	10/05/2016
My-ID 00002	1A	75	30/09/2015

Desired Outcome:

ID	Date of Collection	Test Done	Year since first collection date
My-ID 00001	10/05/2016	1A	0
My-ID 00001	10/01/2017	1A	1
My-ID 00001	23/01/2018	1A	2
My-ID 00001	18/04/2019	1A	3
My-ID 00001	30/04/2020	1A	4
My-ID 00002	30/09/2015	1A	0
My-ID 00002	31/05/2016	1A	1

CodePudding user response：

If possible use first year per groups substract it grom first values per groups by GroupBy.transform with GroupBy.first:

df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)

y = df1['Date of Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))
print (df1)
            ID Date of Collection Test Done  Year since first collection date
0  My-ID 00001         2016-05-10        1A                                 0
1  My-ID 00001         2017-01-10        1A                                 1
2  My-ID 00001         2018-01-23        1A                                 2
3  My-ID 00001         2019-04-18        1A                                 3
4  My-ID 00001         2020-04-30        1A                                 4
5  My-ID 00002         2015-09-30        1A                                 0
6  My-ID 00002         2016-05-31        1A                                 1
7  My-ID 00003         2017-05-31        1A                                 0

If need processing first values from df2 add left join before solution with DataFrame.merge:

df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)
df2['Date of First Collection'] = pd.to_datetime(df2['Date of First Collection'], dayfirst=True)

y = df1.merge(df2, on=['ID','Test Done'], how='left')['Date of First Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))