I've encountered a strange problem while trying to mask rows with missing datetime index values using pandas.Index.dropna. Unexpectedly, applying a mask to select only the non-missing datetime values has the consequence of duplicating identically-indexed rows.
Here is a small example which reproduces the problem (pandas 1.4.2):
import pandas as pd
dates = ["01-May-2022", "02-May-2022", "02-May-2022", "03-May-2022", "04-May-2022"]
values = [1, 2, 3, 4, 5]
series = pd.Series(values, index=dates)
seriesB = series.loc[series.index.dropna()]
In [28]: series
Out[28]:
01-May-2022 1
02-May-2022 2
02-May-2022 3
03-May-2022 4
04-May-2022 5
dtype: int64
In [29]: seriesB
Out[29]:
01-May-2022 1
02-May-2022 2
02-May-2022 3
02-May-2022 2
02-May-2022 3
03-May-2022 4
04-May-2022 5
dtype: int64
Does anyone know why seriesB has the additional rows?
CodePudding user response:
It is because your Series
called series has duplicate values in the index - namely 02-May-2022
so when you use that same value for a lookup in .loc[]
it matches twice:
series.loc["02-May-2022"]
02-May-2022 2
02-May-2022 3
If you have the lookup value in .loc[]
twice then it will return 4 rows (2x2 rows):
series.loc[["02-May-2022","02-May-2022"]]
02-May-2022 2
02-May-2022 3
02-May-2022 2
02-May-2022 3
The .dropna()
has no effect in your example data.
As a side note, you can use this capability of .loc[]
to duplicate values if needed:
series.loc[["01-May-2022","01-May-2022","01-May-2022","01-May-2022"]]
01-May-2022 1
01-May-2022 1
01-May-2022 1
01-May-2022 1
Think of this as an index lookup instead of masking.