Dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
EDIT: With thanks to @maow, the syntax errors are now fixed with updated code above, however I still need some help with the algorithm of replacing the highest 10 values.
I am currently doing a project which requires me to analyse data of wines. I have spotted some extreme outliers in each column of the csv file. In short, I have determined that the highest 10 values of each column must be replaced by the median value of that column. I have tried the following with help from 1 article (Pandas Replace certain values in each column), and I modified it as shown below, but unfortunately this is my first time with python and I have no idea what causing the error.
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/hello/Downloads/winequality-red-ori.csv')
def cut(column):
condition = column > np.percentile(column,99.26470588) //Top 10 rows out of 1360 rows
replacewith = np.median(column) //replace with median
np.select(condition.values.reshape(-1, 1), column.values, replacewith) //input changes
df.set_index(["citric acid", "quality"], inplace=True) //exclude citric acid and quality
df = df.apply(lambda x: cut(x)).reset_index()
df.to_csv('C:/Users/hello/Downloads/new.csv')
I have tried researching what causes the error including missing values in the csv file but I have none. I am also not sure if the above code will help me acheive my goal even without this error. Any help appreciated.
CodePudding user response:
The error appears because you use np.select
wrong. It expects, an array of condtions, an array of choices and a default value in this order.
It works with
np.select(condition.values.reshape(-1, 1), column.values, replacewith)
- You are using a numpy function on pandas objects. This may work, but accessing the underlying
np.array
is imho good practice. - Also
np.select
is not doing what you think it does. Its purpose is to select a single element from an array according to the first hit in a list of conditions. So you basically select the first value that belongs to the 10 largest.
Final Note: By calling set_index
twice, you are basically making citric acid
a value again. You should call
df.set_index(["citric acid", "quality"], inplace=True) # exclude citric acid and quality
EDIT:
The np.select
function expects a list of bool ndarrays
i.e. a 2d datastructure as per documentation. If you look at condition
this looks like this.
In [35]: condition
Out[35]: array([False, False, False, ..., False, False, False])
.reshape
will change the shape of the array. -1
is a shortcut to leave the number of rows the same and 1
means that you create a redundant ndarrray with only one element in each row.
In [36]: condition.reshape(-1, 1)
Out[36]:
array([[False],
[False],
[False],
...,
[False],
[False],
[False]])
This is to match the expected signature.