Weighted average datetime, off but only for certain months-CodePudding

I am calculating the weighted average of a series of datetimes (and must be doing it wrong, since I can't explain the following):

import pandas as pd
import numpy as np

foo = pd.DataFrame({'date': ['2022-06-01', '2022-06-16'],
                    'value': [1000, 10000]})
foo['date'] = pd.to_datetime(foo['date'])
bar = np.average(foo['date'].view(dtype='float64'), weights=foo['value'])
print(np.array(bar).view(dtype='datetime64[ns]'))

returns 2022-06-14T15:16:21.818181818, which is expected.

Changing the month to July:

foo = pd.DataFrame({'date': ['2022-07-01', '2022-07-16'],
                    'value': [1000, 10000]})
foo['date'] = pd.to_datetime(foo['date'])
bar = np.average(foo['date'].view(dtype='float64'), weights=foo['value'])
print(np.array(bar).view(dtype='datetime64[ns]'))

returns 2022-07-14T23:59:53.766924660, when the expected result is 2022-07-14T15:16:21.818181818.

The expected result, as calculated in Excel:

What am I overlooking?

EDIT: Additional Detail

My real dataset is much larger and I'd like to use numpy if possible.
foo['date'] can be assumed to be dates with no time component, but the weighted average will have a time component.

CodePudding user response：

I strongly suspect this is a resolution/rounding issue.

I'm assuming that for averaging dates those are converted to timestamps - and then the result is converted back to a datetime object. But pandas works in nanoseconds, so the timestamp values, multiplied by 1000 and 10000 respectively, exceed 2**52 - i.e. exceed the mantissa capability of 64-bit floats.

On the contrary Excel works in milliseconds, so no problems here; Python's datetime.datetime works in microseconds, so still no problems:

dt01 = datetime(2022,7,1)
dt16 = datetime(2022,7,16)
datetime.fromtimestamp((dt01.timestamp()*1000   dt16.timestamp()*10000)/11000)
datetime.datetime(2022, 7, 14, 15, 16, 21, 818182)

So if you need to use numpy/pandas I suppose your best option is to convert dates to timedeltas from a "starting" date (i.e. define a "custom epoch") and compute the weighted average of those values.

CodePudding user response：

first of all I don't think the problem is in your code, I think pandas has a problem.

the problem is that when you use the view command it translates it to a very small number (e-198) so I believe it accidently lose resolution.

I found a solution (I think it works but it gave an 3 hour difference from your answer):

from datetime import datetime
import pandas as pd
import numpy as np

foo = pd.DataFrame({'date': ['2022-07-01', '2022-07-16'],
                    'value': [1000, 10000]})
foo['date'] = pd.to_datetime(foo['date'])
# bar = np.average([x.timestamp() for x in foo['date']], weights=foo['value'])
bar = np.average(foo['date'].apply(datetime.timestamp), weights=foo['value']) # vectorized
print(datetime.fromtimestamp(bar))