I am trying to get some metrics on some data at my company.
Basically, I have this dataframe that I have titled rawData. rawData contains a number of columns, mostly of parameters I am interested in. The specifics of this are not too important I dont think, so we can just think of these as parameter1, parameter2, and so on.
There is an additional column, which I have titled overallResult. This column will always contain either the string PASS, or FAIL. I am trying to extract a sub-dataframe from my raw data based on the overallResult. It sounds simple enough, but I am messing up my implementation somehow.
I make my mask like this: mask = rawData[overallResult].eq(truthyVal), where in this case truthyVal is PASS
The mask is created successfully, but..
The mask is like this: filteredData = rawData[mask] and I would like filteredData to now contain everything that rawData does, but only on rows where truthyVal exists.
and it always give me this error: cannot reindex on an axis with duplicate labels.
From what I understand, the mask contains a boolean list of my overallResult column, true if truthyVal is found on that row, and false if not. I am pretty sure that I am not applying my mask correctly here. There must be some small extra step I am overlooking, and at this point I am frustrated because it seems so simple, so IDK, any ideas?
CodePudding user response:
You have the principle correct as the following basic example shows:
import pandas as pd
df = pd.DataFrame({'data': [ 1, 2, 3, 4, 5, 6],
'test': ['pass', 'fail', 'pass', 'fail','pass', 'fail']})
mask = df['test'].eq('pass')
print(df[mask])
To decipher your error message it would be necessary to see a data sample which produces it; you might get some useful insights here