Home > Software design >  Pandas dataframe masking error: cannot reindex on an axis with duplicate labels
Pandas dataframe masking error: cannot reindex on an axis with duplicate labels

Time:01-26

I am trying to get some metrics on some data at my company.

Basically, I have this dataframe that I have titled rawData. rawData contains a number of columns, mostly of parameters I am interested in. The specifics of this are not too important I dont think, so we can just think of these as parameter1, parameter2, and so on.

There is an additional column, which I have titled overallResult. This column will always contain either the string PASS, or FAIL. I am trying to extract a sub-dataframe from my raw data based on the overallResult. It sounds simple enough, but I am messing up my implementation somehow.

I make my mask like this: mask = rawData[overallResult].eq(truthyVal), where in this case truthyVal is PASS

The mask is created successfully, but..

The mask is like this: filteredData = rawData[mask] and I would like filteredData to now contain everything that rawData does, but only on rows where truthyVal exists.

and it always give me this error: cannot reindex on an axis with duplicate labels.

From what I understand, the mask contains a boolean list of my overallResult column, true if truthyVal is found on that row, and false if not. I am pretty sure that I am not applying my mask correctly here. There must be some small extra step I am overlooking, and at this point I am frustrated because it seems so simple, so IDK, any ideas?

CodePudding user response:

You have the principle correct as the following basic example shows:

import pandas as pd

df = pd.DataFrame({'data': [ 1, 2, 3, 4, 5, 6],
                  'test': ['pass', 'fail', 'pass', 'fail','pass', 'fail']})

mask = df['test'].eq('pass')
print(df[mask])

To decipher your error message it would be necessary to see a data sample which produces it; you might get some useful insights here

  • Related