Home > Software engineering >  Pandas dataframe filtered with np.where returns wrong data
Pandas dataframe filtered with np.where returns wrong data

Time:09-22

Python 3.8.10, Pandas 1.4.4, Numpy 1.23.2

I'm receiving a data streaming from a machine every 10 seconds (in the database times are precisely stored every 10 seconds: no 9.5 nor 11). The machine goes sometimes in standby so the streaming stops. Occasionally it sends random data when in standby.

To eliminate the spurious data I use a parameter SECONDS: when set to 10, each line is considered valid if there was a stored line 10 seconds before.

import pandas as pd

df = pd.DataFrame(
    {"DT": [   # two occasional readings
        "2022-09-21 00:38:40", "2022-09-21 01:59:30",
        # stable data flow 
        "2022-09-21 04:31:40", "2022-09-21 04:31:50",
        "2022-09-21 04:32:00", "2022-09-21 04:32:10",
        "2022-09-21 04:32:20", "2022-09-21 04:32:30",
        "2022-09-21 04:32:40", "2022-09-21 04:32:50",
        # missed one reading here
        "2022-09-21 04:33:10", "2022-09-21 04:33:20", 
        "2022-09-21 04:33:30", "2022-09-21 04:33:40",
        "2022-09-21 04:33:50", "2022-09-21 04:34:00", 
        "2022-09-21 04:34:10", "2022-09-21 04:34:20",
        "2022-09-21 04:34:30"
    ], 
     "notes": [
        "occasional mistake",
        "occasional mistake",
        "first stable reading",
        "", "", "", "", "", "", "",
        "one reading missing",
        "", "", "", "", "", "", "", ""
     ] },   
)

df.DT = df.DT.astype('datetime64[ns]')

SECONDS = 10  # <====== change here the accepted interval 

df["deltaDT"] = (df["DT"] - df["DT"].shift(SECONDS // 10)).dt.total_seconds()
df["ok"] = np.where(df["deltaDT"]==SECONDS, "ok", "-")


print(df)

This is for me the most intuitive solution, and for 10 seconds it returns the correct answer: enter image description here

(Note: The reading at 2022-09-21 04.33.10 is lost, not an issue here)

Problem arise with only some SECONDS values: 10, 20, 40, 50 are OK. Choosing 30 and 60, the error appears.

This is SECONDS = 30:

enter image description here

It is obviously a problem related to binary conversion. Quite difficult to spot because if I am working with floats that really represent only integer values or NaN, there shouldn't be an infinitesimal epsilon involved.

Changing the filter condition to

df["ok"] = np.where(df["deltaDT"]<=SECONDS EPSILON, "ok", "-")

it return the wrong result for EPSILON=1E-15, and the correct one for EPSILON=1E-14.

My question now (sorry I was longer than the Harry Potter saga):

How to convert df["deltaDT"] = (df["DT"] - df["DT"].shift(SECONDS // 10)).dt.total_seconds() to integer?

Adding .astype("Int64") generates the error "TypeError: cannot safely cast non-equivalent float64 to int64"

CodePudding user response:

to convert your column that contains NaN to integer, it need to be nullable integer:

df["deltaDT"] = df["deltaDT"].astype(pd.Int64Dtype())

CodePudding user response:

The problem is wrongly addressed. It has nothing to do with NaN, but with some approximation when INTEGER seconds are not considered really integer:

df = pd.Series(["2022-09-21 04:32:00", "2022-09-21 04:32:30"]).astype('datetime64[ns]')
deltaT = (df-df.shift()).dt.total_seconds()[1]
print(deltaT)

# 30.000000000000004

That means the time binary representation cannot accomodate all int seconds, but in some case they are approximated.

Rounding the column with np.round() will trunc the epsilon away:

SECONDS = 30  # <====== change here the accepted interval 

df["deltaDT"] = (df["DT"] - df["DT"].shift(SECONDS // 10)).dt.total_seconds()
df["deltaDT"] = np.round(df["deltaDT"], 0) 

df["ok"] = np.where(df["deltaDT"]==SECONDS, "ok", "-")  

returns now the correct

enter image description here

Lesson learned: Don't assume anything when working with "obvious" time intervals.

  • Related