Home > Blockchain >  Select rows if multiple column values are same (or null)
Select rows if multiple column values are same (or null)

Time:10-01

I have a dataframe df1:

val0 dir0 val1 dir1 val2 dir2 val3 dir3
a0 up a1 a2 up a3 up
b0 up b1 down b2 up b3 up
c0 c1 down c2 down c3 down
d0 up d1 d2 d3 up
e0 up e1 down e2 down e3 up
f0 f1 up f2 f3

I want to select rows that have only one direction (or null). So [dir0, dir1, dir2, dir3] must either contain only 'up' or only 'down' not both (empty is allowed).

df_out:

val0 dir0 val1 dir1 val2 dir2 val3 dir3
a0 up a1 a2 up a3 up
c0 c1 down c2 down c3 down
d0 up d1 d2 d3 up
f0 f1 up f2 f3

Is there a good (and scalable) way of doing this or do I have to do the hard-coded comparison like:

df_out = df1[( ( (df1['dir0'] == 'up')   | (df1['dir0'].isnull()) ) & ......) | 
             ( ( (df1['dir0'] == 'down') | (df1['dir0'].isnull()) ) & ......)]

CodePudding user response:

You could convert to set and ensure length is ≤ 1:

df[df.filter(like='dir').apply(lambda x: len(set(i for i in x if isinstance(i, str))), axis=1).le(1)]

This is a quite fast approach compared to nunique. Interestingly, one rare case where apply seems faster.

Timing on 60k rows (df = pd.concat([df]*10000, ignore_index=True))


# multiple masks approach
28.4 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# set approach
274 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# nunique approach
2.86 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

output:

  val0 dir0 val1  dir1 val2  dir2 val3  dir3
0   a0   up   a1   NaN   a2    up   a3    up
2   c0  NaN   c1  down   c2  down   c3  down
3   d0   up   d1   NaN   d2   NaN   d3    up
5   f0  NaN   f1    up   f2   NaN   f3  None

CodePudding user response:

First filter only dir columns by DataFrame.filter, then test for up, missing values or down with test if all values per rows are Trues by DataFrame.all:

df1 = df.filter(like='dir')
m1 = df1.eq('up')
m2 = df1.isna()
m3 = df1.eq('down')

df = df[(m1 | m2).all(axis=1) | (m3 | m2).all(axis=1)]
print (df)
  val0 dir0 val1  dir1 val2  dir2 val3  dir3
0   a0   up   a1   NaN   a2    up   a3    up
2   c0  NaN   c1  down   c2  down   c3  down
3   d0   up   d1   NaN   d2   NaN   d3    up
5   f0  NaN   f1    up   f2   NaN   f3  None

Performance for 60k rows:

df = pd.concat([df]*10000, ignore_index=True)

In [176]: %%timeit
     ...: df1 = df.filter(like='dir')
     ...: m1 = df1.fillna('up').eq('up').all(axis=1)
     ...: m2 = df1.fillna('down').eq('down').all(axis=1)
     ...: 
     ...: df[m1 | m2]
     ...: 
59.9 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [177]: %%timeit
     ...: df1 = df.filter(like='dir')
     ...: m1 = df1.eq('up')
     ...: m2 = df1.isna()
     ...: m3 = df1.eq('down')
     ...: 
     ...: df[(m1 | m2).all(axis=1) | (m3 | m2).all(axis=1)]
     ...: 
     ...: 
44.4 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [178]: %%timeit
     ...: df[df.filter(like='dir').apply(lambda x: len(set(i for i in x if isinstance(i, str))), axis=1).le(1)]
     ...: 
529 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Related