I have a dataframe, and the demo is generated by generate_data()
.
- If the first value in the data column is false, return 0.
- If the first value of the data column is true, return the order of the last position of consecutive true.
I wrote two methods: sort_data()
and sort_data2()
%timeit sort_order(df.copy())
1.12 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sort_order2(df.copy())
715 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Is there a faster way?
My code is as follows:
import pandas as pd
import numpy as np
def generate_data():
order = range(1,7)
data = [True, True, False, False, True, False]
c = {'order': order,
'data': data}
df = pd.DataFrame(c)
return df
def sort_order(df):
order_first_false = df.loc[~df.data, 'order']
if len(order_first_false) == 0:
order_last_true = df.order.values[-1]
else:
order_first_false = order_first_false.values[0]
df = df[df.order < order_first_false]
if len(df):
order_last_true = df.order.values[-1]
else:
order_last_true = 0
return order_last_true
def sort_order2(df):
groups = df[f'data'].ne(True).cumsum()
len_true = len(groups[groups == 0])
if len_true:
order_last_true = df.at[df.index[len_true - 1], 'order'].max()
else:
order_last_true = 0
return order_last_true
def main():
df = generate_data()
print(df)
order_last_true = sort_order(df.copy())
print(order_last_true)
order_last_true = sort_order2(df.copy())
print(order_last_true)
if __name__ == '__main__':
main()
The result I respected is :
order data
0 1 True
1 2 True
2 3 False
3 4 False
4 5 True
5 6 False
2
2
CodePudding user response:
Use numba for processing values to first True
s block, inspiration by this solution:
from numba import njit
@njit
def sort_order3(a, b):
if not a[0]:
return 0
else:
for i in range(1, len(a)):
if not a[i]:
return b[i - 1]
return b[-1]
df = generate_data()
print (sort_order3(df['data'].to_numpy(), df['order'].to_numpy()))
CodePudding user response:
Maybe I am missing something but why dont you just get the index of the first False
in df.data
then use that index to get the value in the df.order
column?
For example:
def sort_order3(df):
try:
idx = df.data.to_list().index(False)
except ValueError: # meaning there is no False in df.data
idx = df.data.size - 1
return df.order[idx]
Or for really large data numpy might be faster:
def sort_order4(df):
try:
idx = np.argwhere(~df.data.values)[0, 0]
except IndexError: # meaning there is no False in df.data
idx = df.data.size - 1
return df.order[idx]
The timing on my device:
%timeit sort_order(df.copy())
565 µs ± 6.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sort_order2(df.copy())
443 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sort_order3(df.copy())
96.5 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit sort_order4(df.copy())
112 µs ± 5.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)