Home > Back-end >  strange pandas behavior: masking missing datetime index values creates duplicate rows
strange pandas behavior: masking missing datetime index values creates duplicate rows

Time:05-05

I've encountered a strange problem while trying to mask rows with missing datetime index values using pandas.Index.dropna. Unexpectedly, applying a mask to select only the non-missing datetime values has the consequence of duplicating identically-indexed rows.

Here is a small example which reproduces the problem (pandas 1.4.2):

import pandas as pd
dates = ["01-May-2022", "02-May-2022", "02-May-2022", "03-May-2022", "04-May-2022"]
values = [1, 2, 3, 4, 5]
series = pd.Series(values, index=dates)
seriesB = series.loc[series.index.dropna()]

In [28]: series
Out[28]: 
01-May-2022    1
02-May-2022    2
02-May-2022    3
03-May-2022    4
04-May-2022    5
dtype: int64

In [29]: seriesB
Out[29]: 
01-May-2022    1
02-May-2022    2
02-May-2022    3
02-May-2022    2
02-May-2022    3
03-May-2022    4
04-May-2022    5
dtype: int64

Does anyone know why seriesB has the additional rows?

CodePudding user response:

It is because your Series called series has duplicate values in the index - namely 02-May-2022 so when you use that same value for a lookup in .loc[] it matches twice:

series.loc["02-May-2022"]

02-May-2022    2
02-May-2022    3

If you have the lookup value in .loc[] twice then it will return 4 rows (2x2 rows):

series.loc[["02-May-2022","02-May-2022"]]

02-May-2022    2
02-May-2022    3
02-May-2022    2
02-May-2022    3

The .dropna() has no effect in your example data.

As a side note, you can use this capability of .loc[] to duplicate values if needed:

series.loc[["01-May-2022","01-May-2022","01-May-2022","01-May-2022"]]

01-May-2022    1
01-May-2022    1
01-May-2022    1
01-May-2022    1

Think of this as an index lookup instead of masking.

  • Related