Pandas reindex to higher resolution-CodePudding

I have a pandas dataframe with index 3 to 15 with 0.5 steps and want to reindex it to 0.1 steps. I tried this code and it doesn't work

# create data and set index and print for verification
df = pd.DataFrame({'A':np.arange(3,5,0.5),'B':np.arange(3,5,0.5)})
df.set_index('A', inplace = True)
df.reindex(np.arange(3,5,0.1)).head(15)

The above code outputs this:

A	B
3.0	3.0
3.1	NaN
3.2	NaN
3.3	NaN
3.4	NaN
3.5	NaN * expected output in this position to be 3.5 since it exists in the original df
3.6	NaN
3.7	NaN
3.8	NaN

Strangely the problem is fixed when reindexing from 0 instead of 3 as it's shown in the code below:

df = pd.DataFrame({'A':np.arange(3,5,0.5),'B':np.arange(3,5,0.5)})
df.set_index('A', inplace = True)
print(df.head())
df.reindex(np.arange(0,5,0.1)).head(60)

The output now correctly shows

A	B
0.0	NaN
...	...
3.0	3.0
3.1	NaN
3.2	NaN
3.3	NaN
3.4	NaN
3.5	3.5
3.6	NaN
3.7	NaN
3.8	NaN

I'm running python 3.8.5 on Windows 10.

Pandas version is 1.4.07

Numpy version is 1.22.1

Does anyone know why this happens? If it's a known or new bug? If the bug has been fixed in a newer version of python, pandas or numpy?

Thanks

CodePudding user response：

Good question.

The answer is because np.arange(3,5,0.1) creates a value of 3.5 that is not exactly 3.5. It is 3.5000000000000004. But np.arange(0,5,0.1) does create a 3.5 that is exactly 3.5. Plus, np.arange(3,5,0.5) also generates a 3.5 that is exactly 3.5.

pd.Index(np.arange(3,5,0.1)) 

Float64Index([               3.0,                3.1,                3.2,
              3.3000000000000003, 3.4000000000000004, 3.5000000000000004,
              3.6000000000000005, 3.7000000000000006, 3.8000000000000007,
               3.900000000000001,  4.000000000000001,  4.100000000000001,
               4.200000000000001,  4.300000000000001,  4.400000000000001,
               4.500000000000002,  4.600000000000001,  4.700000000000001,
               4.800000000000002,  4.900000000000002],
             dtype='float64')

and

pd.Index(np.arange(0,5,0.1))

Float64Index([                0.0,                 0.1,                 0.2,
              0.30000000000000004,                 0.4,                 0.5,
               0.6000000000000001,  0.7000000000000001,                 0.8,
                              0.9,                 1.0,                 1.1,
               1.2000000000000002,                 1.3,  1.4000000000000001,
                              1.5,                 1.6,  1.7000000000000002,
                              1.8,  1.9000000000000001,                 2.0,
                              2.1,                 2.2,  2.3000000000000003,
               2.4000000000000004,                 2.5,                 2.6,
                              2.7,  2.8000000000000003,  2.9000000000000004,
                              3.0,                 3.1,                 3.2,
               3.3000000000000003,  3.4000000000000004,                 3.5,
                              3.6,                 3.7,  3.8000000000000003,
               3.9000000000000004,                 4.0,  4.1000000000000005,
                              4.2,                 4.3,                 4.4,
                              4.5,  4.6000000000000005,                 4.7,
                4.800000000000001,                 4.9],
             dtype='float64')

and

pd.Index(np.arange(3,5,0.5))

Float64Index([3.0, 3.5, 4.0, 4.5], dtype='float64')

This is definitely related to Numpy:

np.arange(3,5,0.1)[5]

3.5000000000000004

and

np.arange(3,5,0.1)[5] == 3.5

False

This situation is documented in the Numpy arange doc:

https://numpy.org/doc/stable/reference/generated/numpy.arange.html

The length of the output might not be numerically stable.

Another stability issue is due to the internal implementation of numpy.arange. The actual step value used to populate the array is dtype(start step) - dtype(start) and not step. Precision loss can occur here, due to casting or due to using floating points when start is much larger than step. This can lead to unexpected behaviour.

It looks like np.linspace might be able to help you out here:

pd.Index(np.linspace(3,5,num=21))

Float64Index([3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2,
              4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0],
             dtype='float64')

A	B
0.0	NaN
...	...
3.0	3.0
3.1	NaN
3.2	NaN
3.3	NaN
3.4	NaN
3.5	3.5
3.6	NaN
3.7	NaN
3.8	NaN

A	B
0.0	NaN
...	...
3.0	3.0
3.1	NaN
3.2	NaN
3.3	NaN
3.4	NaN
3.5	3.5
3.6	NaN
3.7	NaN
3.8	NaN

A	B
0.0	NaN
...	...
3.0	3.0
3.1	NaN
3.2	NaN
3.3	NaN
3.4	NaN
3.5	3.5
3.6	NaN
3.7	NaN
3.8	NaN