Home > Net >  Compare all columns in a DataFrame with each other
Compare all columns in a DataFrame with each other

Time:01-29

Let's say that this is my dataframe:

id    c1    c3    c5    c7
1     10    11    12    13
2     10    14    11    13
3     15    14    12    10

I want to compare each column with the next one in each row.

If c1 > c3 and c3 > c5 and so on, the result for that row is 1, if c1 < c3 and c3 < c5 and so on, the result for that row is -1, otherwise the result should be 0.

This can be done with a simple loop like this:

number_of_rows = df.shape(0)
for i in range(0, number_of_rows):
   if df[i, c1] < df[i, c3] and df[i, c3] < df[i, c5] and df[i, c5] < df[i, c7]:
       df[i, 'result'] = -1
   elif df[i, c1] > df[i, c3] and df[i, c3] > df[i, c5] and df[i, c5] > df[i, c7]:
       df[i, 'result'] = 1
   else:
       df[i, 'result'] = 0

and the expected result should be a dataframe like this:

id    result
1     1
2     0
3     -1

I feel like there must be a faster/cleaner way to do this, using pandas functions.

Also, I'm not sure if it helps but as you can see, column names have a pattern

CodePudding user response:

import numpy as np
import pandas as pd

# Your df here:
df = pd.read_clipboard().set_index("id")

out = np.select([
    df.agg(lambda x: x.is_monotonic_increasing, axis=1),
    df.agg(lambda x: x.is_monotonic_decreasing, axis=1)
], [1, -1])

out:

array([ 1,  0, -1])

CodePudding user response:

You can use df.diff to see increments/decrements:

diff = df.diff(axis=1)

#checks if all differences are positive
c1 = diff.fillna(1).ge(1).all(1)

#checks if all differences are negative
c2 = diff.fillna(-1).lt(0).all(1)

#assign
df['result'] = np.select([c1,c2],[1,-1],0)

print(df)

    c1  c3  c5  c7  result
id                        
1   10  11  12  13       1
2   10  14  11  13       0
3   15  14  12  10      -1

CodePudding user response:

A simple 1 line solution

df['result'] = df.apply(lambda x: 1 if x.c1>x.c3>x.c5>x.c7 else( -1 if x.c1<x.c3<x.c5<x.c7 else 0),axis=1)

output:-

df.result
0   -1
1    0
2    1
Name: result, dtype: int64

CodePudding user response:

tl;dr:

df['result'] = ((df.c1>df.c3) & (df.c3>df.c5) & (df.c5>df.c7))*1   ((df.c1<df.c3) & (df.c3<df.c5) & (df.c5<df.c7))*-1

Never ever iterates rows. That is the first rule of pandas.

Python is a very slow language. The reason why it is also the most popular language of the moment, and that many very fast applications are coded in python, is because python coders do whatever it takes so that time consuming code are not really in python.

And the method to do so is to use libraries such as pandas and numpy, python libraries that are not written in python, to do heavy computations. In the case of a processing of lot of data, time consuming code is the code you apply on all rows of a big dataframe; so you use a library that will do the iteration on all rows and the computation on each of them for you.

If you use pandas but still iterate the rows yourself, you miss the point. In that case, you would be better off to simply use a python list or dictionary.

Note that the same goes for apply: it is just the next thing not to do. It is still better that coding the for loop yourself, because at least the for loop is coded inside pandas code, but still, what you do on each row remains python code.

So, in your case you must find a way to do your computation by batch of whole columns (note that I assume that you have way more rows than columns... unlike in your example. That is usually the case with dataframe).

For example, df.c1 > df.c3 computes an array of booleans

You could combine this to have what you want that way

df['result'] = ((df.c1>df.c3) & (df.c3>df.c5) & (df.c5>df.c7))*1   ((df.c1<df.c3) & (df.c3<df.c5) & (df.c5<df.c7))*-1

Some timing considerations

For only 3 rows, overhead of batch computations make the for loop based (your solution) still the faster.

But difference shows for bigger number of rows. See table below

Method Timing 3 rows 1000 rows 10k rows 100k rows
Yours (for loop) 1.3 ms 394 ms 3808 ms 38400 ms
Chrysophylax (agg) 2.4 ms 140 ms 1290 ms 13800 ms
Nehal's (apply) 1.0 ms 71 ms 695 ms 6930 ms
Anky's 2.6 ms 2.8 ms 4.2 ms 18 ms
Mine (df['c1']>df['c3']...) 2.0 ms 2.0 ms 2.2 ms 7.5 ms

Note that 394/1.3 is 303. So almost the 333 number of row ratios. Your solution is O(n) and it already shows.

Chrysophylax (and other lambda based solutions) is already faster. Because it spare the python for loop itself, but does not spare the pure python execution of what is done in the for loop. Again, the worst crime in pandas, is iterating for loop. The second worst crime, is using lambdas. And for n big enough you get a factor 3 gain from it.

My solution, and other non-for, non-lambda based solutions is also O(n) — there are no free meal, and any method, pandas based or not, that compute something on every row has to be at least O(n). But even at n=10000 it doesn't show, because it is so fast that even at n=10000 it is still the overhead that we pay, and computation itself is still negligible. For n=100000, time is still only 7.5 ms (it starts, at last, to increase, but still not proportionaly).

So you see, gain is at least ×5000 for data big enough.

Edit: I wrote this before anky's solution was posted. Note that their solution also avoid both for and lambda, so it has the same kind of time factor. For n=100000, their code takes 18 ms/run on my computer. That is slower that my 7.5 ms/run, sure. But in the same order of magnitude. Plus, their solution has an important advantage over mine: it would work without touching the code for any number of columns (for mine, you would have to add some & (df.c7 > df.c9) to the code to add a c9 column)

  • Related