I'm working with a Pandas dataframe, and have a column of dependant variables (called CLASS), which consists of three classes: Y, N, and P. However, when I run -
df.CLASS.unique()
I get -
array(['N', 'N ', 'P', 'Y', 'Y '], dtype=object)
I opened up the dataset in Excel, and tried using the filter to see how many unique variables were in the column; Excel says there are only 3.
Terribly confused here, would greatly appreciate some help. The dataset is available here if it's of any benefit.
CodePudding user response:
"N with a space" and a "single N", both are different in Pandas, but I think, for Excel, they are the same.
You have to preprocess that data, use this:
df['CLASS'] = df['CLASS'].replace('N ', 'N')
df['CLASS'] = df['CLASS'].replace('Y ', 'Y')
df.CLASS.unique()
You will get 3 classes after that.
array(['N', 'P', 'Y'], dtype=object)
P.S. I tried running =UNIQUE(N2:N1001)
this command to find uniques in Excel, and it has returned me 5 values. So, IDK what's wrong with your Excel.