I am using the following code to return the number of rows where the NID and Fdat in df1 and df2 are the same and the Tdate in df2 is less than the RxDate in df1.
def breedings(row):
row['TLSCC'] = len(df2[((df2['NID']==row['NID']) & (df2['Fdat']==row['Fdat']) & (df2['Tdate'] < row['RxDate']) )] )
return row
df1 = df1.apply(lambda x:breedings(x),axis=1)
The result is place in a new column TLSCC in df1
I would like to search the records in df2 to find the maximum SCC value using the same conditions that were used to determine the number of rows that met the conditions. ie I would like to return the max.scc value from these rows to a new column in df1.
Replacing len with max does not work as it does not link to SCC. I have also tried putting max into the apply lambda statement. Without success.
Any pointers appreciated.
The data for df1 and df2 is provided below.
df1
{'NID': {1: '212017463',
2: '212017463',
3: '212017463',
4: '213015177',
5: '213015177',
6: '213015177',
7: '213015177',
8: '213015177',
9: '213015177',
10: '213015177'},
'Fdat': {1: Timestamp('2020-08-06 00:00:00'),
2: Timestamp('2020-08-06 00:00:00'),
3: Timestamp('2020-08-06 00:00:00'),
4: Timestamp('2018-08-30 00:00:00'),
5: Timestamp('2018-08-30 00:00:00'),
6: Timestamp('2018-08-30 00:00:00'),
7: Timestamp('2018-08-30 00:00:00'),
8: Timestamp('2018-08-30 00:00:00'),
9: Timestamp('2018-08-30 00:00:00'),
10: Timestamp('2018-08-30 00:00:00')},
'RxDate': {1: Timestamp('2020-09-30 00:00:00'),
2: Timestamp('2020-10-12 00:00:00'),
3: Timestamp('2020-10-15 00:00:00'),
4: Timestamp('2018-12-06 00:00:00'),
5: Timestamp('2018-12-07 00:00:00'),
6: Timestamp('2018-12-09 00:00:00'),
7: Timestamp('2018-12-11 00:00:00'),
8: Timestamp('2018-12-17 00:00:00'),
9: Timestamp('2018-12-18 00:00:00'),
10: Timestamp('2018-12-19 00:00:00')}}
df2
{'NID': {36: '212017463',
37: '212017463',
38: '212017463',
39: '212017463',
40: '212017463',
41: '212017463',
42: '212017463',
43: '212017463',
44: '212017463',
45: '212017463',
46: '212017463',
47: '212017463',
48: '212017463',
49: '212017463',
50: '212017463',
51: '212017463',
52: '212017463',
53: '212017463',
54: '212017463',
55: '212017463',
56: '212017463',
57: '212017463',
58: '212017463',
59: '212017463',
60: '212017463',
61: '212017463',
62: '212017463',
63: '212017463',
64: '212017463',
65: '212017463',
66: '212017463',
67: '212017463',
68: '212017463',
69: '212017463',
70: '213015177',
71: '213015177',
72: '213015177',
73: '213015177',
74: '213015177',
75: '213015177',
76: '213015177',
77: '213015177',
78: '213015177',
79: '213015177',
80: '213015177',
81: '213015177',
82: '213015177',
83: '213015177',
84: '213015177',
85: '213015177',
86: '213015177',
87: '213015177',
88: '213015177',
89: '213015177',
90: '213015177',
91: '213015177',
92: '213015177',
93: '213015177',
94: '213015177',
95: '213015177',
96: '213015177',
97: '213015177',
98: '213015177',
99: '213015177',
100: '213015177',
101: '213015177'},
'Fdat': {36: Timestamp('2014-08-17 00:00:00'),
37: Timestamp('2014-08-17 00:00:00'),
38: Timestamp('2014-08-17 00:00:00'),
39: Timestamp('2014-08-17 00:00:00'),
40: Timestamp('2014-08-17 00:00:00'),
41: Timestamp('2015-09-22 00:00:00'),
42: Timestamp('2015-09-22 00:00:00'),
43: Timestamp('2015-09-22 00:00:00'),
44: Timestamp('2015-09-22 00:00:00'),
45: Timestamp('2015-09-22 00:00:00'),
46: Timestamp('2015-09-22 00:00:00'),
47: Timestamp('2016-09-01 00:00:00'),
48: Timestamp('2016-09-01 00:00:00'),
49: Timestamp('2016-09-01 00:00:00'),
50: Timestamp('2016-09-01 00:00:00'),
51: Timestamp('2016-09-01 00:00:00'),
52: Timestamp('2017-08-31 00:00:00'),
53: Timestamp('2017-08-31 00:00:00'),
54: Timestamp('2017-08-31 00:00:00'),
55: Timestamp('2017-08-31 00:00:00'),
56: Timestamp('2017-08-31 00:00:00'),
57: Timestamp('2017-08-31 00:00:00'),
58: Timestamp('2018-09-17 00:00:00'),
59: Timestamp('2018-09-17 00:00:00'),
60: Timestamp('2018-09-17 00:00:00'),
61: Timestamp('2018-09-17 00:00:00'),
62: Timestamp('2018-09-17 00:00:00'),
63: Timestamp('2019-08-25 00:00:00'),
64: Timestamp('2019-08-25 00:00:00'),
65: Timestamp('2019-08-25 00:00:00'),
66: Timestamp('2019-08-25 00:00:00'),
67: Timestamp('2019-08-25 00:00:00'),
68: Timestamp('2020-08-06 00:00:00'),
69: Timestamp('2020-08-06 00:00:00'),
70: Timestamp('2015-09-01 00:00:00'),
71: Timestamp('2015-09-01 00:00:00'),
72: Timestamp('2015-09-01 00:00:00'),
73: Timestamp('2015-09-01 00:00:00'),
74: Timestamp('2015-09-01 00:00:00'),
75: Timestamp('2015-09-01 00:00:00'),
76: Timestamp('2016-08-18 00:00:00'),
77: Timestamp('2016-08-18 00:00:00'),
78: Timestamp('2016-08-18 00:00:00'),
79: Timestamp('2016-08-18 00:00:00'),
80: Timestamp('2016-08-18 00:00:00'),
81: Timestamp('2017-09-01 00:00:00'),
82: Timestamp('2017-09-01 00:00:00'),
83: Timestamp('2017-09-01 00:00:00'),
84: Timestamp('2017-09-01 00:00:00'),
85: Timestamp('2017-09-01 00:00:00'),
86: Timestamp('2017-09-01 00:00:00'),
87: Timestamp('2018-08-30 00:00:00'),
88: Timestamp('2018-08-30 00:00:00'),
89: Timestamp('2018-08-30 00:00:00'),
90: Timestamp('2018-08-30 00:00:00'),
91: Timestamp('2018-08-30 00:00:00'),
92: Timestamp('2019-08-09 00:00:00'),
93: Timestamp('2019-08-09 00:00:00'),
94: Timestamp('2019-08-09 00:00:00'),
95: Timestamp('2019-08-09 00:00:00'),
96: Timestamp('2019-08-09 00:00:00'),
97: Timestamp('2020-08-06 00:00:00'),
98: Timestamp('2020-08-06 00:00:00'),
99: Timestamp('2020-08-06 00:00:00'),
100: Timestamp('2020-08-06 00:00:00'),
101: Timestamp('2020-08-06 00:00:00')},
'Tdate': {36: Timestamp('2015-04-16 00:00:00'),
37: Timestamp('2015-02-13 00:00:00'),
38: Timestamp('2014-10-10 00:00:00'),
39: Timestamp('2014-12-17 00:00:00'),
40: Timestamp('2014-11-07 00:00:00'),
41: Timestamp('2015-09-29 00:00:00'),
42: Timestamp('2016-05-30 00:00:00'),
43: Timestamp('2016-02-10 00:00:00'),
44: Timestamp('2016-04-12 00:00:00'),
45: Timestamp('2015-11-06 00:00:00'),
46: Timestamp('2015-12-18 00:00:00'),
47: Timestamp('2017-05-31 00:00:00'),
48: Timestamp('2016-11-28 00:00:00'),
49: Timestamp('2017-04-04 00:00:00'),
50: Timestamp('2017-02-15 00:00:00'),
51: Timestamp('2016-10-03 00:00:00'),
52: Timestamp('2018-05-29 00:00:00'),
53: Timestamp('2018-03-20 00:00:00'),
54: Timestamp('2018-04-24 00:00:00'),
55: Timestamp('2018-02-05 00:00:00'),
56: Timestamp('2017-10-03 00:00:00'),
57: Timestamp('2017-11-28 00:00:00'),
58: Timestamp('2019-06-03 00:00:00'),
59: Timestamp('2019-04-10 00:00:00'),
60: Timestamp('2019-01-14 00:00:00'),
61: Timestamp('2018-10-02 00:00:00'),
62: Timestamp('2018-11-14 00:00:00'),
63: Timestamp('2020-03-24 00:00:00'),
64: Timestamp('2020-05-21 00:00:00'),
65: Timestamp('2019-11-14 00:00:00'),
66: Timestamp('2020-01-21 00:00:00'),
67: Timestamp('2019-09-27 00:00:00'),
68: Timestamp('2020-09-30 00:00:00'),
69: Timestamp('2020-12-16 00:00:00'),
70: Timestamp('2015-09-29 00:00:00'),
71: Timestamp('2016-04-12 00:00:00'),
72: Timestamp('2016-05-30 00:00:00'),
73: Timestamp('2016-02-10 00:00:00'),
74: Timestamp('2015-11-06 00:00:00'),
75: Timestamp('2015-12-18 00:00:00'),
76: Timestamp('2017-04-04 00:00:00'),
77: Timestamp('2017-05-31 00:00:00'),
78: Timestamp('2017-02-15 00:00:00'),
79: Timestamp('2016-11-28 00:00:00'),
80: Timestamp('2016-10-03 00:00:00'),
81: Timestamp('2018-05-29 00:00:00'),
82: Timestamp('2018-03-20 00:00:00'),
83: Timestamp('2018-04-24 00:00:00'),
84: Timestamp('2017-10-03 00:00:00'),
85: Timestamp('2018-02-05 00:00:00'),
86: Timestamp('2017-11-28 00:00:00'),
87: Timestamp('2019-04-10 00:00:00'),
88: Timestamp('2019-06-03 00:00:00'),
89: Timestamp('2019-01-14 00:00:00'),
90: Timestamp('2018-11-14 00:00:00'),
91: Timestamp('2018-10-02 00:00:00'),
92: Timestamp('2020-05-21 00:00:00'),
93: Timestamp('2020-03-24 00:00:00'),
94: Timestamp('2019-11-14 00:00:00'),
95: Timestamp('2020-01-21 00:00:00'),
96: Timestamp('2019-09-27 00:00:00'),
97: Timestamp('2021-05-13 00:00:00'),
98: Timestamp('2021-01-27 00:00:00'),
99: Timestamp('2021-03-18 00:00:00'),
100: Timestamp('2020-09-30 00:00:00'),
101: Timestamp('2020-12-16 00:00:00')},
'SCC': {36: 26,
37: 853,
38: 66,
39: 39,
40: 18,
41: 1704,
42: 16,
43: 13,
44: 23,
45: 51,
46: 15,
47: 95,
48: 21,
49: 51,
50: 57,
51: 65,
52: 55,
53: 77,
54: 53,
55: 65,
56: 17,
57: 51,
58: 113,
59: 57,
60: 58,
61: 520,
62: 735,
63: 50,
64: 98,
65: 12,
66: 109,
67: 19,
68: 6939,
69: 1891,
70: 166,
71: 15,
72: 25,
73: 14,
74: 34,
75: 24,
76: 45,
77: 74,
78: 17,
79: 3,
80: 7,
81: 32,
82: 27,
83: 122,
84: 4,
85: 62,
86: 7,
87: 35,
88: 74,
89: 117,
90: 110,
91: 2,
92: 55,
93: 316,
94: 17,
95: 19,
96: 8,
97: 77,
98: 12,
99: 50,
100: 16,
101: 19}}
CodePudding user response:
I would first join and then apply a condition on the dates to generate the joined table that matches your conditions:
>>> join = df1.merge(df2, on=['NID', 'Fdat'])
>>> join = join[join['Tdate'] < join['RxDate']]
>>> join
NID Fdat RxDate Tdate SCC
2 212017463 2020-08-06 2020-10-12 2020-09-30 6939
4 212017463 2020-08-06 2020-10-15 2020-09-30 6939
9 213015177 2018-08-30 2018-12-06 2018-11-14 110
10 213015177 2018-08-30 2018-12-06 2018-10-02 2
14 213015177 2018-08-30 2018-12-07 2018-11-14 110
15 213015177 2018-08-30 2018-12-07 2018-10-02 2
19 213015177 2018-08-30 2018-12-09 2018-11-14 110
20 213015177 2018-08-30 2018-12-09 2018-10-02 2
24 213015177 2018-08-30 2018-12-11 2018-11-14 110
25 213015177 2018-08-30 2018-12-11 2018-10-02 2
29 213015177 2018-08-30 2018-12-17 2018-11-14 110
30 213015177 2018-08-30 2018-12-17 2018-10-02 2
34 213015177 2018-08-30 2018-12-18 2018-11-14 110
35 213015177 2018-08-30 2018-12-18 2018-10-02 2
39 213015177 2018-08-30 2018-12-19 2018-11-14 110
40 213015177 2018-08-30 2018-12-19 2018-10-02 2
Then a groupby on that table gives you all you want, the number of records with .size()
and the max with ['SCC'].max()
:
>>> join.groupby(df1.columns.to_list()).size().rename('TLSCC')
NID Fdat RxDate
212017463 2020-08-06 2020-10-12 1
2020-10-15 1
213015177 2018-08-30 2018-12-06 2
2018-12-07 2
2018-12-09 2
2018-12-11 2
2018-12-17 2
2018-12-18 2
2018-12-19 2
Name: TLSCC, dtype: int64
>>> join.groupby(df1.columns.to_list())['SCC'].max()
NID Fdat RxDate
212017463 2020-08-06 2020-10-12 6939
2020-10-15 6939
213015177 2018-08-30 2018-12-06 110
2018-12-07 110
2018-12-09 110
2018-12-11 110
2018-12-17 110
2018-12-18 110
2018-12-19 110
You can then put that all back in df1 with .merge
:
>>> group = join.groupby(df1.columns.to_list())
>>> df1 = df1.merge(group.size().rename('TLSCC').reset_index())\
... .merge(group['SCC'].max().rename('max.scc').reset_index())
...
>>> df1
NID Fdat RxDate TLSCC max.scc
0 212017463 2020-08-06 2020-10-12 1 6939
1 212017463 2020-08-06 2020-10-15 1 6939
2 213015177 2018-08-30 2018-12-06 2 110
3 213015177 2018-08-30 2018-12-07 2 110
4 213015177 2018-08-30 2018-12-09 2 110
5 213015177 2018-08-30 2018-12-11 2 110
6 213015177 2018-08-30 2018-12-17 2 110
7 213015177 2018-08-30 2018-12-18 2 110
8 213015177 2018-08-30 2018-12-19 2 110