Multiple instances of unique variables in dataframe column-CodePudding

I'm working with a Pandas dataframe, and have a column of dependant variables (called CLASS), which consists of three classes: Y, N, and P. However, when I run -

df.CLASS.unique()

I get -

array(['N', 'N ', 'P', 'Y', 'Y '], dtype=object)

I opened up the dataset in Excel, and tried using the filter to see how many unique variables were in the column; Excel says there are only 3.
Terribly confused here, would greatly appreciate some help. The dataset is available here if it's of any benefit.

CodePudding user response：

"N with a space" and a "single N", both are different in Pandas, but I think, for Excel, they are the same.
You have to preprocess that data, use this:

df['CLASS'] = df['CLASS'].replace('N ', 'N')
df['CLASS'] = df['CLASS'].replace('Y ', 'Y')

df.CLASS.unique() You will get 3 classes after that.

array(['N', 'P', 'Y'], dtype=object)

P.S. I tried running =UNIQUE(N2:N1001) this command to find uniques in Excel, and it has returned me 5 values. So, IDK what's wrong with your Excel.