Home > Net >  Dropping rows that contains a specific condition
Dropping rows that contains a specific condition

Time:11-23

I got a dataset and I want to drop a few unusable rows. I used a filter to the specific condition in which i want the rows to be dropped

filter = df.groupby(['Bairro'], group_keys=False, sort=True).size() > 1 print(filter.to_string())

Bairro
01 True
02 False

All the data in which the condition is false is useless. I've tried a few things, none of them work.

So, I'd like the dataframe to maintain only the values where the condition is true:

Bairro
01 True

df2 = ((df.groupby(['Bairro']).size()) != 1)

I was even planning to dropping value by value, but it didn't work as well

df2 = df[~df.isin(['02']).any(axis=1)]

Tried passing the filter as a condition:

df.drop(df[df.groupby(['Bairro'], group_keys=False, sort=True).size() > 1], inplace = True)

CodePudding user response:

It seems like the df.loc method could help you in this instance. In your example:

new_df = df.loc[df['col2'] == "True"]

Or if you would like to use multiple conditions:

new_df = df.loc[(df['col1'] == "True") & (df['col2'] == "True")]

CodePudding user response:

I think you're over-engineering your solution therefore I've opted for a more detailed explaination of the answer.

One way to filter a dataframe is to simply subscript a list/array of booleans. If the length of the array is the same as the length of the dataframe, this will output a view of the dataframe containing only rows aligned with the True values.

Here is an example:

import pandas as pd
df = pd.DataFrame({
    'numbers': [0,1,2,3,4],
    'letters': ['a','b','c','d','e'],
    'colors': ['red', 'blue', 'yellow', 'green', 'purple']
})
df

Which outputs:

numbers letters colors
0 0 a red
1 1 b blue
2 2 c yellow
3 3 d green
4 4 e purple

This is what I mean by subscripting a boolean list (not sure if this is accepted terminology)

boolean_list = [True, True, False, True, False]
filtered_df = df[boolean_list]
filtered_df

Which outputs:

numbers letters colors
0 0 a red
1 1 b blue
3 3 d green

We can use simple arguments to produce this boolean list from a dataframe

df['numbers']>2

Outputs:

0    False
1    False
2    False
3     True
4     True
Name: numbers, dtype: bool

We can streamline the filtering with this redundant looking piece of code:

df[df['numbers']>2]

outputs:

numbers letters colors
3 3 d green
4 4 e purple

While it looks redundant, all we've done there is subscribe a list of booleans. As written, this does not change df at all, for that we would need to do df = df[filter_argument]

For more complicated filtering we can use .apply() to get our list of booleans. Say we only want rows where the letter in 'letters' is present in the color in 'colors':

def letter_in_color(row):
    return row['letters'] in row['colors']
boolean_arr = df.apply(letter_in_color, axis = 1)
print(boolean_arr)

0    False
1     True
2    False
3    False
4     True
dtype: bool

letter_in_color_df = df[boolean_array]
letter_in_color_df
numbers letters colors
1 1 b blue
4 4 e purple

I did this long explaination because while the concept of filtering a df with a boolean array is quite simple, looking at code which does that often looks weird or redundant and it isn't clear what is really going on.

I hope you didn't stop reading there:

because there is an important and powerful tool which you can add to the above situations to preclude many errors and unexpected behavior: ".loc[]" This is a more explicit and powerful indexer, and in all of the above cases we can gain its benefits with very few changes:

df[boolean_array] becomes df.loc[boolean_array]

For more information about df.loc[] instead of df[] see this answer

  • Related