item ID | type |
---|---|
1 | A |
1 | B |
2 | A |
3 | B |
4 | B |
4 | B |
5 | B |
5 | A |
I'm trying to find out which item has only 1 type. In this data, answer is [1,5].
1 has both A & B. It's good.
2 only has A, so fault.
3 only has B, so fault.
4 only has B, even it appeared twice, fault.
5 has both A & B. So whatever its order, It's good.
so I try this code
for i in data.groupby("item ID"): # pandas
_ = len(i[1].type.unique())
if _ == 1:
data.drop(i[1].index, inplpace=True)
it works well, But too slow.
please someone tell me faster way.
CodePudding user response:
Use GroupBy.transform
with DataFrameGroupBy.nunique
for get number of unique values per groups and test if not 1 value by Series.ne
, filter rows by boolean indexing
:
df = data[data.groupby("item ID")['type'].transform('nunique').ne(1)]
Without groupby if there are always 2 or 1 values per group by column item ID
is possible solution with DataFrame.duplicated
:
df = data[~data.duplicated(keep=False) & data.duplicated('item ID', keep=False)]
print (df)
item ID type
0 1 A
1 1 B
6 5 B
7 5 A
For get unique item ID
use:
s = data.groupby("item ID")['type'].nunique()
L = s.index[s.ne(1)].tolist()
print (L)
[1, 5]
mask = ~data.duplicated(keep=False) & data.duplicated('item ID', keep=False)
L = data.loc[mask, 'item ID'].unique().tolist()
print (L)
[1, 5]
Performance:
np.random.seed(20)
df = pd.DataFrame({'item ID':np.random.randint(3000, size=5000),
'type':np.random.choice(['A','B'], size=5000)})
#[3908 rows x 2 columns]
data = df.sort_values('item ID').groupby('item ID').head(2)
print (data)
In [51]: %%timeit
...: for i in data.groupby("item ID"): # pandas
...: _ = len(i[1].type.unique())
...: if _ == 1:
...: data.drop(i[1].index, inplace=True)
...:
119 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#BeRT2me solution
In [52]: %timeit data.groupby('item ID').filter(lambda x: x['type'].nunique() > 1)
709 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [53]: %timeit data[data.groupby("item ID")['type'].transform('nunique').ne(1)]
1.75 ms ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [54]: %timeit data[~data.duplicated(keep=False)&data.duplicated('item ID', keep=False)]
2.1 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)