I have a dataframe df1:
val0 | dir0 | val1 | dir1 | val2 | dir2 | val3 | dir3 |
---|---|---|---|---|---|---|---|
a0 | up | a1 | a2 | up | a3 | up | |
b0 | up | b1 | down | b2 | up | b3 | up |
c0 | c1 | down | c2 | down | c3 | down | |
d0 | up | d1 | d2 | d3 | up | ||
e0 | up | e1 | down | e2 | down | e3 | up |
f0 | f1 | up | f2 | f3 |
I want to select rows that have only one direction (or null). So [dir0, dir1, dir2, dir3] must either contain only 'up' or only 'down' not both (empty is allowed).
df_out:
val0 | dir0 | val1 | dir1 | val2 | dir2 | val3 | dir3 |
---|---|---|---|---|---|---|---|
a0 | up | a1 | a2 | up | a3 | up | |
c0 | c1 | down | c2 | down | c3 | down | |
d0 | up | d1 | d2 | d3 | up | ||
f0 | f1 | up | f2 | f3 |
Is there a good (and scalable) way of doing this or do I have to do the hard-coded comparison like:
df_out = df1[( ( (df1['dir0'] == 'up') | (df1['dir0'].isnull()) ) & ......) |
( ( (df1['dir0'] == 'down') | (df1['dir0'].isnull()) ) & ......)]
CodePudding user response:
You could convert to set and ensure length is ≤ 1:
df[df.filter(like='dir').apply(lambda x: len(set(i for i in x if isinstance(i, str))), axis=1).le(1)]
This is a quite fast approach compared to nunique
. Interestingly, one rare case where apply
seems faster.
Timing on 60k rows (df = pd.concat([df]*10000, ignore_index=True)
)
# multiple masks approach
28.4 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# set approach
274 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# nunique approach
2.86 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
output:
val0 dir0 val1 dir1 val2 dir2 val3 dir3
0 a0 up a1 NaN a2 up a3 up
2 c0 NaN c1 down c2 down c3 down
3 d0 up d1 NaN d2 NaN d3 up
5 f0 NaN f1 up f2 NaN f3 None
CodePudding user response:
First filter only dir
columns by DataFrame.filter
, then test for up
, missing values or down
with test if all values per rows are True
s by DataFrame.all
:
df1 = df.filter(like='dir')
m1 = df1.eq('up')
m2 = df1.isna()
m3 = df1.eq('down')
df = df[(m1 | m2).all(axis=1) | (m3 | m2).all(axis=1)]
print (df)
val0 dir0 val1 dir1 val2 dir2 val3 dir3
0 a0 up a1 NaN a2 up a3 up
2 c0 NaN c1 down c2 down c3 down
3 d0 up d1 NaN d2 NaN d3 up
5 f0 NaN f1 up f2 NaN f3 None
Performance for 60k rows:
df = pd.concat([df]*10000, ignore_index=True)
In [176]: %%timeit
...: df1 = df.filter(like='dir')
...: m1 = df1.fillna('up').eq('up').all(axis=1)
...: m2 = df1.fillna('down').eq('down').all(axis=1)
...:
...: df[m1 | m2]
...:
59.9 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [177]: %%timeit
...: df1 = df.filter(like='dir')
...: m1 = df1.eq('up')
...: m2 = df1.isna()
...: m3 = df1.eq('down')
...:
...: df[(m1 | m2).all(axis=1) | (m3 | m2).all(axis=1)]
...:
...:
44.4 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [178]: %%timeit
...: df[df.filter(like='dir').apply(lambda x: len(set(i for i in x if isinstance(i, str))), axis=1).le(1)]
...:
529 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)