how do i determine if a certain population has met the conditions of having a test done consecutively for a few years based on their first test date? A example dataset can be found below. I'm thinking of using dataset 1 as df1 and Dataset 2 as df2 but my problem is I'm not sure how to use the first date of collection for different IDs to minus the different collection date of the same IDs?
Dataset 1:
ID | Date of Collection | Test Done |
---|---|---|
My-ID 00001 | 10/05/2016 | 1A |
My-ID 00001 | 10/01/2017 | 1A |
My-ID 00001 | 23/01/2018 | 1A |
My-ID 00001 | 18/04/2019 | 1A |
My-ID 00001 | 30/04/2020 | 1A |
My-ID 00002 | 30/09/2015 | 1A |
My-ID 00002 | 31/05/2016 | 1A |
My-ID 00002 | 31/05/2017 | 1A |
My-ID 00003 | 31/05/2017 | 1A |
Dataset 2:
ID | Test Done | Result | Date of Collection |
---|---|---|---|
My-ID 00001 | 1A | 50 | 10/05/2016 |
My-ID 00002 | 1A | 75 | 30/09/2015 |
Desired Outcome:
ID | Date of Collection | Test Done | Year since first collection date |
---|---|---|---|
My-ID 00001 | 10/05/2016 | 1A | 0 |
My-ID 00001 | 10/01/2017 | 1A | 1 |
My-ID 00001 | 23/01/2018 | 1A | 2 |
My-ID 00001 | 18/04/2019 | 1A | 3 |
My-ID 00001 | 30/04/2020 | 1A | 4 |
My-ID 00002 | 30/09/2015 | 1A | 0 |
My-ID 00002 | 31/05/2016 | 1A | 1 |
CodePudding user response:
If possible use first year per groups substract it grom first values per groups by GroupBy.transform
with GroupBy.first
:
df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)
y = df1['Date of Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))
print (df1)
ID Date of Collection Test Done Year since first collection date
0 My-ID 00001 2016-05-10 1A 0
1 My-ID 00001 2017-01-10 1A 1
2 My-ID 00001 2018-01-23 1A 2
3 My-ID 00001 2019-04-18 1A 3
4 My-ID 00001 2020-04-30 1A 4
5 My-ID 00002 2015-09-30 1A 0
6 My-ID 00002 2016-05-31 1A 1
7 My-ID 00003 2017-05-31 1A 0
If need processing first values from df2
add left join before solution with DataFrame.merge
:
df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)
df2['Date of First Collection'] = pd.to_datetime(df2['Date of First Collection'], dayfirst=True)
y = df1.merge(df2, on=['ID','Test Done'], how='left')['Date of First Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))