Let's assume I have the following DataFrame:
book | page | words_per_page | characters |
---|---|---|---|
Book 1 | 1 | 27 | Nan |
Book 1 | 2 | Nan | 30 |
Book 1 | 2 | 30 | 25 |
Book 2 | 1 | 30 | 26 |
Book 2 | 2 | 28 | 30 |
How can I delete the books containing at least one null value for either Words Per Page
or Characters
? In the above situation, I want to drop all the entries associated with Book 1.
It is a combination between these two questions, but I just couldn't get it right yet.
pandas How to drop the whole row if any specific columns contains a specific values?
How to delete rows from a dataframe if specific columns contains null values?
It should be a way to do it directly on the dataframe. I have lots of data and I need to make this run as efficiently as possible, I'm trying to avoid if
statements outside of the dataframe.
Thanks everyone!
CodePudding user response:
You could find the rows which have either of the two values as NaN, then exclude those from your df using the ~
operator
import pandas as pd
import numpy as np
df = pd.DataFrame({'book': ['Book 1', 'Book 1', 'Book 1', 'Book 2', 'Book 2'],
'page': [1, 2, 2, 1, 2],
'words_per_page': ['27', np.nan, '30', '30', '28'],
'characters': [np.nan, '30', '25', '26', '30']})
books_with_nulls = df.loc[df[['words_per_page','characters']].isnull().any(axis=1)]['book'].unique()
df.loc[~df['book'].isin(books_with_nulls)]
CodePudding user response:
You can groupby
and filter
which removes entire groups/books that fail the filter. Here the filter is that each book group has to be notnull
in all
values for Words_per_page
and characters
import pandas as pd
import numpy as np
df = pd.DataFrame({
'book': ['Book 1', 'Book 1', 'Book 1', 'Book 2', 'Book 2'],
'page': [1, 2, 2, 1, 2],
'words_per_page': [27, np.nan, 30, 30, 28],
'characters': [np.nan, 30, 25, 26, 30]
})
filt_df = (
df.groupby('book').filter(
lambda b: b['words_per_page'].notnull().all() and b['characters'].notnull().all()
)
)
filt_df
groupby.filter
is notoriously slow on large datasets, not sure if it will be fast enough for your usecase
CodePudding user response:
Identify book ids that match that criteria and filter the DF using those book ids, e.g.
import pandas as pd
data = [
[1, 1, 27, None],
[1, 2, None, 30],
[1, 2, 30, 25],
[2, 1, 30, 26],
[2, 2, 28, 30]
]
columns = [
'book_id',
'page_number',
'words_per_page',
'character_count'
]
df = pd.DataFrame(data, columns=columns)
df = df[~df.book_id.isin(
df[
(df.words_per_page.isna()) |
(df.character_count.isna())
].book_id
)].copy()
book_id page_number words_per_page character_count
3 2 1 30.0 26.0
4 2 2 28.0 30.0