I have a dataframe called df which contains two columns, one called order_id and one called article_name.
The dataframe has some faulty entries, where each and every article_name that belongs to one order is empty.
What I'm trying to do is filter out all rows which have an order_id for which every row with the same order_id has an empty string in place of the article_name.
I tried these two lines of code, neither of which seemed to work.
df\[df.groupby('order_id')\['article_name'\].transform('all') == ''\]
df\[df.groupby('order_id')\['article_name'\].all() == ''\]
I know for a fact that there are orders where every article_name is empty, however these two claim that there are none. What am I doing wrong?
CodePudding user response:
Sample data:
from random import randint as ri
from faker import Faker
fake = Faker()
df = pd.DataFrame({
'order_id': [ri(1,20) for _ in range(0,100)],
'article_name':[fake.name() if ri(0,1) == 1 else '' for _ in range(0,100)]
})
df
Output:
order_id article_name
0 8 Morgan Sullivan
1 12 Brian Hogan
2 11 Tony Carrillo
... ... ...
97 4
98 18
99 10 Matthew Smith
100 rows × 2 columns
Filter only empty ones:
df[df['article_name'] == '']['order_id']
Output:
6 1
9 11
13 2
..
93 12
97 4
98 18
Name: order_id, Length: 49, dtype: int64
Count how many empty articles are within one particular order_id:
df[df['article_name'] == ''].groupby('order_id').count()
Output:
article_name
order_id
1 4
2 2
3 1
... ...
18 3
19 6
20 3
20 rows × 1 columns
Bear in mind this will only work if empty lines are ''
. It may also be np.nan
, than df['article_name'].isna()
.
whether the article_name field is empty for all rows which have that order_id
df['is_empty'] = df.article_name.isna()
gr_c = df.groupby('order_id').count()
gr_c[gr_c.article_name == gr_c.is_empty]
Output:
article_name is_empty
order_id
19 4 4
Means only order_id
19 is all empty.