Pandas DataFrame - how to drop the entire category containing null values for specific columns-CodePudding

Let's assume I have the following DataFrame:

book	page	words_per_page	characters
Book 1	1	27	Nan
Book 1	2	Nan	30
Book 1	2	30	25
Book 2	1	30	26
Book 2	2	28	30

How can I delete the books containing at least one null value for either Words Per Page or Characters? In the above situation, I want to drop all the entries associated with Book 1.

It is a combination between these two questions, but I just couldn't get it right yet.

pandas How to drop the whole row if any specific columns contains a specific values?

How to delete rows from a dataframe if specific columns contains null values?

It should be a way to do it directly on the dataframe. I have lots of data and I need to make this run as efficiently as possible, I'm trying to avoid if statements outside of the dataframe.

Thanks everyone!

CodePudding user response：

You could find the rows which have either of the two values as NaN, then exclude those from your df using the ~ operator

import pandas as pd
import numpy as np
df = pd.DataFrame({'book': ['Book 1', 'Book 1', 'Book 1', 'Book 2', 'Book 2'],
 'page': [1, 2, 2, 1, 2],
 'words_per_page': ['27', np.nan, '30', '30', '28'],
 'characters': [np.nan, '30', '25', '26', '30']})

books_with_nulls = df.loc[df[['words_per_page','characters']].isnull().any(axis=1)]['book'].unique()
df.loc[~df['book'].isin(books_with_nulls)]

CodePudding user response：

You can groupby and filter which removes entire groups/books that fail the filter. Here the filter is that each book group has to be notnull in all values for Words_per_page and characters

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'book': ['Book 1', 'Book 1', 'Book 1', 'Book 2', 'Book 2'],
    'page': [1, 2, 2, 1, 2],
    'words_per_page': [27, np.nan, 30, 30, 28],
    'characters': [np.nan, 30, 25, 26, 30]
})

filt_df = (
    df.groupby('book').filter(
        lambda b: b['words_per_page'].notnull().all() and b['characters'].notnull().all() 
    )
)

filt_df

groupby.filter is notoriously slow on large datasets, not sure if it will be fast enough for your usecase

CodePudding user response：

Identify book ids that match that criteria and filter the DF using those book ids, e.g.

import pandas as pd

data = [
    [1, 1, 27, None],
    [1, 2, None, 30],
    [1, 2, 30, 25],
    [2, 1, 30, 26],
    [2, 2, 28, 30]
]

columns = [
    'book_id',
    'page_number',
    'words_per_page',
    'character_count'
]

df = pd.DataFrame(data, columns=columns)
df = df[~df.book_id.isin(
    df[
        (df.words_per_page.isna()) |
        (df.character_count.isna())
    ].book_id
)].copy()

   book_id  page_number words_per_page  character_count
3  2        1           30.0            26.0
4  2        2           28.0            30.0