Home > Software design >  How to determine if a test was done how many years consecutively for individuals using pandas?
How to determine if a test was done how many years consecutively for individuals using pandas?

Time:04-11

how do i determine if a certain population has met the conditions of having a test done consecutively for a few years based on their first test date? A example dataset can be found below. I'm thinking of using dataset 1 as df1 and Dataset 2 as df2 but my problem is I'm not sure how to use the first date of collection for different IDs to minus the different collection date of the same IDs?

Dataset 1:

ID Date of Collection Test Done
My-ID 00001 10/05/2016 1A
My-ID 00001 10/01/2017 1A
My-ID 00001 23/01/2018 1A
My-ID 00001 18/04/2019 1A
My-ID 00001 30/04/2020 1A
My-ID 00002 30/09/2015 1A
My-ID 00002 31/05/2016 1A
My-ID 00002 31/05/2017 1A
My-ID 00003 31/05/2017 1A

Dataset 2:

ID Test Done Result Date of Collection
My-ID 00001 1A 50 10/05/2016
My-ID 00002 1A 75 30/09/2015

Desired Outcome:

ID Date of Collection Test Done Year since first collection date
My-ID 00001 10/05/2016 1A 0
My-ID 00001 10/01/2017 1A 1
My-ID 00001 23/01/2018 1A 2
My-ID 00001 18/04/2019 1A 3
My-ID 00001 30/04/2020 1A 4
My-ID 00002 30/09/2015 1A 0
My-ID 00002 31/05/2016 1A 1

CodePudding user response:

If possible use first year per groups substract it grom first values per groups by GroupBy.transform with GroupBy.first:

df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)

y = df1['Date of Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))
print (df1)
            ID Date of Collection Test Done  Year since first collection date
0  My-ID 00001         2016-05-10        1A                                 0
1  My-ID 00001         2017-01-10        1A                                 1
2  My-ID 00001         2018-01-23        1A                                 2
3  My-ID 00001         2019-04-18        1A                                 3
4  My-ID 00001         2020-04-30        1A                                 4
5  My-ID 00002         2015-09-30        1A                                 0
6  My-ID 00002         2016-05-31        1A                                 1
7  My-ID 00003         2017-05-31        1A                                 0

If need processing first values from df2 add left join before solution with DataFrame.merge:

df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)
df2['Date of First Collection'] = pd.to_datetime(df2['Date of First Collection'], dayfirst=True)

y = df1.merge(df2, on=['ID','Test Done'], how='left')['Date of First Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))
  • Related