Home > database >  How to add a new column to dataframe based on conditions on another column
How to add a new column to dataframe based on conditions on another column

Time:01-08

I have the following example dataframe:

d = {'col1': [4, 2, 8, 4, 3, 7, 6, 9, 3, 5]}
df = pd.DataFrame(data=d)
df


    col1
0   4
1   2
2   8
3   4
4   3
5   7
6   6
7   9
8   3
9   5

I need to add col2 to this dataframe, and values of this new column will be set by comparing col1 values (from different rows) as described below. Each row of col2 will be set as following:

df.loc[0, "col2"] will say how many of df.loc[1, "col1"], df.loc[2, "col1"] and df.loc[3, "col1"] are bigger than df.loc[0, "col1"].

df.loc[1, "col2"] will say how many of df.loc[2, "col1"], df.loc[3, "col1"] and df.loc[4, "col1"] are bigger than df.loc[1, "col1"].

df.loc[2, "col2"] will say how many of df.loc[3, "col1"], df.loc[4, "col1"] and df.loc[5, "col1"] are bigger than df.loc[2, "col1"].

And so on...

If there are not 3 rows left after the index N, col2 value will be set to -1.

The end result will look like the following:

    col1    col2
0   4       1
1   2       3
2   8       0
3   4       2
4   3       3
5   7       1
6   6       1
7   9       -1
8   3       -1
9   5       -1

I need a function that will take a dataframe as input and will return the dataframe by adding the new column as described above.

In the example above, next 3 rows are considered. But this needs to be configurable and should be an input to the function that will do the work.

Speed is important here so it is not desired to use for loops.

How can this be done in the most efficient way in Python?

CodePudding user response:

You need a reversed rolling to compare the values to the next ones:

N = 3
df['col2'] = (df.loc[::-1, 'col1']
                .rolling(N 1)
                .apply(lambda s: s.iloc[:-1].gt(s.iloc[-1]).sum())
                .fillna(-1, downcast='infer')
             )

Alternatively, using numpy.lib.stride_tricks.sliding_window_view:

import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as swv

N = 3
df['col2'] = np.r_[(df['col1'].to_numpy()[:-N, None]
                    < swv(df['col1'], N)[1:]  # broadcasted comparison
                    ).sum(axis=1),            # count True per row
                   -np.ones(N, dtype=int)]    # add missing -1

Output:

   col1  col2
0     4     1
1     2     3
2     8     0
3     4     2
4     3     3
5     7     1
6     6     1
7     9    -1
8     3    -1
9     5    -1
  • Related