Home > OS >  What is the Fastest way to compare values across columns in pandas (Python)
What is the Fastest way to compare values across columns in pandas (Python)

Time:06-16

I have the following dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.array([[1, 1, 1, 1], [1, 1, np.nan, 1], [1, np.nan, 1, 1]]),
                    columns=['t', 't_1', 't_2', 't_3'])

Which in reality have ~10 Million rows. I need a fast way to know which is the last consecutive column that have a non null value. Taking this df as an example the results would be ->

df_result = pd.DataFrame(np.array([[1, 1, 1, 1], [1, 1, np.nan, np.nan], [1, np.nan, np.nan, np.nan]]),
                    columns=['t', 't_1', 't_2', 't_3'])

currently I'm doing this with the following lambda function, but the result is too slow:

def second_to_last_null(*args):
    for i in range(len(args)):
        if np.isnan(args[i]):
            return np.nan
        else:
            return args[-1]


df_result['t'] = df['t']
df_result['t_1_consecutive'] = df[['t', 't_1']].apply(lambda x: second_to_last_null(x.t, x.t_1), axis=1)
df_result['t_2_consecutive'] = df[['t', 't_1', 't_2']].apply(lambda x: second_to_last_null(x.t, x.t_1, x.t_2), axis=1)
df_result['t_3_consecutive'] = df[['t', 't_1', 't_2', 't_3']].apply(lambda x: second_to_last_null(x.t, x.t_1, x.t_2, x.t_3), axis=1)

Can somebody suggest the fastest way to do this in pandas/numpy? A simple technical explanation as to why that method is better than mine would also be very appreciated :)

CodePudding user response:

Try cumsum on isna, then mask

df_result = df.mask(df.isna().cumsum(axis=1) >= 1)

Output:

     t  t_1  t_2  t_3
0  1.0  1.0  1.0  1.0
1  1.0  1.0  NaN  NaN
2  1.0  NaN  NaN  NaN

Explanation: df.isna() mask the nan with True, else False. Then taking the cumsum(axis=1) allow you to find the cumulative number of nan so far (on the rows). Finally, all cumsum >= 1 indicates that there is a nan before that position.

  • Related