Integrate a column using integrate.trapz with datetime index-CodePudding

Good morning,

I have the follwing code to determine the integral of a column using the trapez method :

import pandas as pd
from scipy import integrate

df = pd.DataFrame()
df['Date'] = ['29/07/2021', '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '29/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021',   '30/07/2021']
df['Time'] = ['06:48:37',   '06:59:37', '07:14:37', '07:27:44', '07:42:44', '07:57:44', '08:04:32', '08:19:32', '09:01:20', '09:19:06', '09:49:06', '10:09:01', '10:23:31', '10:39:18', '10:54:17', '11:09:17', '11:20:01', '11:35:01', '11:50:00', '11:54:14', '12:09:14', '12:22:01', '12:30:15', '12:45:15', '13:00:15', '13:40:15', '13:55:15', '14:10:15', '14:27:15', '14:42:15', '14:57:15', '15:12:15', '15:27:15', '15:42:15', '15:57:15', '16:12:15', '08:12:50', '08:42:50', '08:57:50', '09:12:50', '09:42:50', '09:57:50', '10:12:50', '10:27:50', '10:42:50', '10:57:50', '11:12:50', '11:27:50', '11:42:50', '11:57:50', '12:12:50', '12:27:50', '12:42:50', '12:57:50', '13:12:50', '13:31:48', '13:43:25', '15:15:20', '15:24:44', '15:34:44', '15:39:03', '15:45:28', '15:55:28', '16:05:28', '16:15:28', '16:25:28', '16:35:28', '16:45:28', '16:55:28', '17:05:28', '17:15:28', '17:25:28', '17:35:28', '17:45:28', '17:55:28', '18:05:28', '18:15:28', '18:25:28']
df['Column1'] = [0.01153489116, 0.01345839865,  0.01779293663,  0.0188075811,   0.02593143441,  0.02516351682,  0.02656128256,  0.02774365902,  0.01068687582,  0.0492178287,   0.03830963094,  0.03982806424,  0.01197452205,  0.0452324925,   0.056356989,    0.057672,   0.06444093731,  0.01257135768,  0.0293379174,   0.01347513612,  0.03167956869,  0.03127426809,  0.0561366325,   0.04949798985,  0.0480188952,   0.0357266179,   0.01970254124,  0.01941959216,  0.01782295605,  0.01299120592,  0.0269445306,   0.01212425752,  0.01330537192,  0.00983425672,  0.0101417148,   0.02101192236,  0.01781862992,  0.00758453253,  0.0076804071,   0.00922775574,  0.0073747856,   0.00853069657,  0.03282369543,  0.02961645624,  0.03013929116,  0.010247364,    0.03243998824,  0.01806667814,  0.0325989132,   0.03179977488,  0.03362982444,  0.0094431753,   0.0082718999,   0.0109086495,   0.04043482872,  0.01571583463,  0.0573673107,   0.03165296424,  0.02008226187,  0.01864084944,  0.02020784928,  0.00982873458,  0.00791156214,  0.0123223301,   0.0067242825,   0.00775056588,  0.004625349911, 0.003382658468, 0.0075472771,   0.006104127873, 0.01520061243,  0.00891038148,  0.0069686624,   0.006432309,    0.00254625114,  0.003212563191, 0.00237200964,  0.001625559964]
df['DateTime'] = pd.to_datetime(df['Date'])   pd.to_timedelta(df['Time'])
dp = df.set_index('DateTime')
dp['Column2'] = dp['Column1'].rolling('1D').apply(integrate.trapz)
print(dp['Column2'].head(1000))

It is working, but the problem is that the trapez method is like this for a range on the x-axis [a, b]:

(b - a) * (f(b) f(a) / 2)

and if we consider many points inside the range [a, b], it will be the 'sum' of (f(b) f(a) / 2) which we multiple afterwards with (b - a).

The code is doing the sum, but he is not multiplying by the time difference in the index which is the datetime.

Would you please give me a hint on why it's not doing that ? Thank you.

PS : take as example the first two points : (0.01153489116 0.01345839865)/2 = 0,012496644905 which is exactly the 2nd value that you will see in the print so it's not multiplying by the datetime difference in index

CodePudding user response：

You need to pass the index as second argument to integrate.trapz in order for it to be able to compute the integral.

dp['Column1'].rolling('1D').apply(lambda x : integrate.trapz(x,x.index))

Which results in

|DateTime            |Column1       |
|--------------------|--------------|
|2021-07-29 06:48:37 |  0.000000e 00|
|2021-07-29 06:59:37 |  8.247786e 09|
|2021-07-29 07:14:37 |  2.231089e 10|
|     ...            |      ...     |
|2021-07-30 18:25:28 |  7.380168e 11|

However I don't see the use of such an integral between to points that are a day apart. If you are interested in a cumulative integral I would suggest using

df['integral'] = ((df.DateTime-df.DateTime.shift(1))*\
                 ((df.Column1 df.Column1.shift(1))/2))[1:].cumsum()

which results in the same dataframe in the beginning but keeps integrating

Or if what you are looking for with rolling('1D') is to compute the Integral for each day separately, then you could use

dp['Column1'].resample('d').apply(lambda x : integrate.trapz(x,x.index))

Which results in

DateTime
2021-07-29   0 days 00:16:19.051562019
2021-07-30   0 days 00:12:18.016804347
Freq: D, Name: Column1, dtype: timedelta64[ns]