Home > Net >  How to find the last true position of the group starting from the first position to be true faster?
How to find the last true position of the group starting from the first position to be true faster?

Time:12-10

I have a dataframe, and the demo is generated by generate_data().

  1. If the first value in the data column is false, return 0.
  2. If the first value of the data column is true, return the order of the last position of consecutive true.

I wrote two methods: sort_data() and sort_data2()

%timeit sort_order(df.copy())
1.12 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sort_order2(df.copy())
715 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Is there a faster way?

My code is as follows:

import pandas as pd
import numpy as np


def generate_data():
    order = range(1,7)
    data = [True, True, False, False, True, False]
    c = {'order': order,
         'data': data}
    df = pd.DataFrame(c)
    return df


def sort_order(df):
    order_first_false = df.loc[~df.data, 'order']
    if len(order_first_false) == 0:
        order_last_true = df.order.values[-1]
    else:
        order_first_false = order_first_false.values[0]
        df = df[df.order < order_first_false]
        if len(df):
            order_last_true = df.order.values[-1]
        else:
            order_last_true = 0
    return order_last_true


def sort_order2(df):
    groups = df[f'data'].ne(True).cumsum()
    len_true = len(groups[groups == 0])
    if len_true:
        order_last_true = df.at[df.index[len_true - 1], 'order'].max()
    else:
        order_last_true = 0
    return order_last_true


def main():
    df = generate_data()
    print(df)

    order_last_true = sort_order(df.copy())
    print(order_last_true)

    order_last_true = sort_order2(df.copy())
    print(order_last_true)


if __name__ == '__main__':
    main()

The result I respected is :

   order   data
0      1   True
1      2   True
2      3  False
3      4  False
4      5   True
5      6  False

2

2

CodePudding user response:

Use numba for processing values to first Trues block, inspiration by this solution:

from numba import njit

@njit
def sort_order3(a, b):
    if not a[0]:
        return 0
    else:
        for i in range(1, len(a)):
            if not a[i]:
                return b[i - 1]
        return b[-1]


  
df = generate_data()
print (sort_order3(df['data'].to_numpy(), df['order'].to_numpy()))

CodePudding user response:

Maybe I am missing something but why dont you just get the index of the first False in df.data then use that index to get the value in the df.order column?

For example:

def sort_order3(df):
    try:
        idx = df.data.to_list().index(False)
    except ValueError: # meaning there is no False in df.data
        idx = df.data.size - 1
    return df.order[idx]

Or for really large data numpy might be faster:

def sort_order4(df):
    try:
        idx = np.argwhere(~df.data.values)[0, 0]
    except IndexError: # meaning there is no False in df.data
        idx = df.data.size - 1
    return df.order[idx]

The timing on my device:

%timeit sort_order(df.copy())
565 µs ± 6.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit sort_order2(df.copy())
443 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit sort_order3(df.copy())
96.5 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit sort_order4(df.copy())
112 µs ± 5.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • Related