Home > Software engineering >  remove only 1 type item in datafream without using groupby
remove only 1 type item in datafream without using groupby

Time:06-30

item ID type
1 A
1 B
2 A
3 B
4 B
4 B
5 B
5 A

I'm trying to find out which item has only 1 type. In this data, answer is [1,5].

1 has both A & B. It's good.
2 only has A, so fault.
3 only has B, so fault.
4 only has B, even it appeared twice, fault.
5 has both A & B. So whatever its order, It's good.

so I try this code

for i in data.groupby("item ID"):  # pandas
    _ = len(i[1].type.unique())
    if _ == 1:
        data.drop(i[1].index, inplpace=True)

it works well, But too slow.

please someone tell me faster way.

CodePudding user response:

Use GroupBy.transform with DataFrameGroupBy.nunique for get number of unique values per groups and test if not 1 value by Series.ne, filter rows by boolean indexing:

df = data[data.groupby("item ID")['type'].transform('nunique').ne(1)]

Without groupby if there are always 2 or 1 values per group by column item ID is possible solution with DataFrame.duplicated:

df = data[~data.duplicated(keep=False) & data.duplicated('item ID', keep=False)]
print (df)
   item ID type
0        1    A
1        1    B
6        5    B
7        5    A

For get unique item ID use:

s = data.groupby("item ID")['type'].nunique()
L = s.index[s.ne(1)].tolist()
print (L)
[1, 5]

mask = ~data.duplicated(keep=False) & data.duplicated('item ID', keep=False)
L = data.loc[mask, 'item ID'].unique().tolist()
print (L)
[1, 5]

Performance:

np.random.seed(20)

df = pd.DataFrame({'item ID':np.random.randint(3000, size=5000),
                   'type':np.random.choice(['A','B'], size=5000)})

#[3908 rows x 2 columns]
data = df.sort_values('item ID').groupby('item ID').head(2)
print (data)


In [51]: %%timeit
    ...: for i in data.groupby("item ID"):  # pandas
    ...:     _ = len(i[1].type.unique())
    ...:     if _ == 1:
    ...:         data.drop(i[1].index, inplace=True)
    ...:         
119 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#BeRT2me solution
In [52]: %timeit data.groupby('item ID').filter(lambda x: x['type'].nunique() > 1)
709 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [53]: %timeit data[data.groupby("item ID")['type'].transform('nunique').ne(1)]
1.75 ms ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [54]: %timeit data[~data.duplicated(keep=False)&data.duplicated('item ID', keep=False)]
2.1 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • Related