Home > database >  Pandas DataFrame - how to drop the entire category containing null values for specific columns
Pandas DataFrame - how to drop the entire category containing null values for specific columns

Time:08-31

Let's assume I have the following DataFrame:

book page words_per_page characters
Book 1 1 27 Nan
Book 1 2 Nan 30
Book 1 2 30 25
Book 2 1 30 26
Book 2 2 28 30

How can I delete the books containing at least one null value for either Words Per Page or Characters? In the above situation, I want to drop all the entries associated with Book 1.

It is a combination between these two questions, but I just couldn't get it right yet.

pandas How to drop the whole row if any specific columns contains a specific values?

How to delete rows from a dataframe if specific columns contains null values?

It should be a way to do it directly on the dataframe. I have lots of data and I need to make this run as efficiently as possible, I'm trying to avoid if statements outside of the dataframe.

Thanks everyone!

CodePudding user response:

You could find the rows which have either of the two values as NaN, then exclude those from your df using the ~ operator

import pandas as pd
import numpy as np
df = pd.DataFrame({'book': ['Book 1', 'Book 1', 'Book 1', 'Book 2', 'Book 2'],
 'page': [1, 2, 2, 1, 2],
 'words_per_page': ['27', np.nan, '30', '30', '28'],
 'characters': [np.nan, '30', '25', '26', '30']})

books_with_nulls = df.loc[df[['words_per_page','characters']].isnull().any(axis=1)]['book'].unique()
df.loc[~df['book'].isin(books_with_nulls)]

CodePudding user response:

You can groupby and filter which removes entire groups/books that fail the filter. Here the filter is that each book group has to be notnull in all values for Words_per_page and characters

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'book': ['Book 1', 'Book 1', 'Book 1', 'Book 2', 'Book 2'],
    'page': [1, 2, 2, 1, 2],
    'words_per_page': [27, np.nan, 30, 30, 28],
    'characters': [np.nan, 30, 25, 26, 30]
})

filt_df = (
    df.groupby('book').filter(
        lambda b: b['words_per_page'].notnull().all() and b['characters'].notnull().all() 
    )
)

filt_df

groupby.filter is notoriously slow on large datasets, not sure if it will be fast enough for your usecase

CodePudding user response:

Identify book ids that match that criteria and filter the DF using those book ids, e.g.

import pandas as pd

data = [
    [1, 1, 27, None],
    [1, 2, None, 30],
    [1, 2, 30, 25],
    [2, 1, 30, 26],
    [2, 2, 28, 30]
]

columns = [
    'book_id',
    'page_number',
    'words_per_page',
    'character_count'
]

df = pd.DataFrame(data, columns=columns)
df = df[~df.book_id.isin(
    df[
        (df.words_per_page.isna()) |
        (df.character_count.isna())
    ].book_id
)].copy()
   book_id  page_number words_per_page  character_count
3  2        1           30.0            26.0
4  2        2           28.0            30.0
  • Related