Home > Enterprise >  Filtering dataframe by grouping on a column and checking if condition holds true for each member of
Filtering dataframe by grouping on a column and checking if condition holds true for each member of

Time:12-14

I have a dataframe called df which contains two columns, one called order_id and one called article_name.

The dataframe has some faulty entries, where each and every article_name that belongs to one order is empty.

What I'm trying to do is filter out all rows which have an order_id for which every row with the same order_id has an empty string in place of the article_name.

I tried these two lines of code, neither of which seemed to work.

df\[df.groupby('order_id')\['article_name'\].transform('all') == ''\]

df\[df.groupby('order_id')\['article_name'\].all() == ''\]

I know for a fact that there are orders where every article_name is empty, however these two claim that there are none. What am I doing wrong?

CodePudding user response:

Sample data:

from random import randint as ri
from faker import Faker
fake = Faker()

df = pd.DataFrame({
    'order_id': [ri(1,20) for _ in range(0,100)],
    'article_name':[fake.name() if ri(0,1) == 1 else '' for _ in range(0,100)]
})
df
Output:
    order_id    article_name
0   8           Morgan Sullivan
1   12          Brian Hogan
2   11          Tony Carrillo
... ...         ...
97  4   
98  18  
99  10          Matthew Smith

100 rows × 2 columns

Filter only empty ones:

df[df['article_name'] == '']['order_id']
Output:
6      1
9     11
13     2
      ..
93    12
97     4
98    18

Name: order_id, Length: 49, dtype: int64

Count how many empty articles are within one particular order_id:

df[df['article_name'] == ''].groupby('order_id').count()
Output:
            article_name
order_id    
1           4
2           2
3           1
...         ...
18          3
19          6
20          3

20 rows × 1 columns

Bear in mind this will only work if empty lines are ''. It may also be np.nan, than df['article_name'].isna().

whether the article_name field is empty for all rows which have that order_id

df['is_empty'] = df.article_name.isna()
gr_c = df.groupby('order_id').count()
gr_c[gr_c.article_name == gr_c.is_empty]
Output:
            article_name    is_empty
order_id        
19          4               4

Means only order_id 19 is all empty.

  • Related