I am trying to clean a dataset although run into an error where red is not recognised and I am not sure if I have written the function correctly. Ideally I want to drop rows based on the tolerances per colour and length. I am trying to create a function for this. I want to be able to pass a colour, upper tolerance and lower tolerance and remove the row from the dataset.
Thanks!
import pandas as pd
df = pd.DataFrame(
{
"Colour": [
"Red",
"Red",
"Red",
"Red",
"Red",
"Blue",
"Blue",
"Blue",
"Green",
"Green",
"Green",
],
"Length": [14, 15, 16, 20, 15, 15, 18, 17, 15, 19, 18],
}
)
def tolerance_drop(Colour, Upper, Lower):
for i in range(0, len(df)):
if (df.loc[i, "Colour"] == Colour) & (df.loc[i, "Length"] > Upper):
df.drop([i])
elif (df.loc[i, "Colour"] == Colour) & (df.loc[i, "Length"] < Lower):
df.drop([i])
else:
break
# should remove 2 red rows giving 9 remaining rows
tolerance_drop("Red", 19.150, 14.5)
print(df)
Output:
it simply prints the dataframe the same as before. No rows are deleted.
CodePudding user response:
Avoid using an explicit looping if you able to apply pandas vectorized operations.
Simple filtering:
In [466]: df = df[~((df.Colour == 'Red') & ((df.Length > 19.150) | (df.Length < 14.5)))]
In [467]: df
Out[467]:
Colour Length
1 Red 15
2 Red 16
4 Red 15
5 Blue 15
6 Blue 18
7 Blue 17
8 Green 15
9 Green 19
10 Green 18
CodePudding user response:
As pointed out in the comments, there are better ways for doing this.
But if you are learning and want to know why your function doesn't work, you should try this:
def tolerance_drop(Colour, Upper, Lower):
for i in range(0, len(df)):
if df.loc[i, "Colour"] == Colour and (df.loc[i, "Length"] > Upper or df.loc[i, "Length"] < Lower):
df.drop([i], inplace=True)
tolerance_drop("Red", 19.150, 14.5)
print(df)
In your version, the break
statement will exit the for
-loop as soon as that line of code is reached, so you don't want that.
In python &
is a bitwise operator that has a different meaning. To combine conditions, you can use and
/or
.
When you drop a row, the resulting dataframe won't be magically saved into the same variable, unless you use the inplace=True
argument.
Output:
Colour Length
1 Red 15
2 Red 16
4 Red 15
5 Blue 15
6 Blue 18
7 Blue 17
8 Green 15
9 Green 19
10 Green 18