Home > Enterprise >  Faster way of shifting data across columns in dataframe pandas
Faster way of shifting data across columns in dataframe pandas

Time:06-10

I have a dataframe divided in variables (a,b) and on time values (1,5). The columns of the dataframe are a combination of the variables and time values ("a_1"). However, I need to transform these time values that are absolute into relative values. For that, I have another dataframe with reference indicator stating how many time values to move.

Therefore, I want to shift the positions of the values I have according to a reference indicator, which is represented by another dataframe, and that changes according to the index.

EX: If for a specific index, the reference indicator is 3, I'd want the data in that index to move to the left until the position 3 goes to a_1 (so is moves 2 (3-1) places), such as: Original:

        a_1       a_2       a_3      a_4       a_5
0  0.854592  0.677819  0.071725  0.29312  0.948375

Shifted:

 a_1      a_2       a_3  a_4  a_5
0  0.071725  0.29312  0.948375  NaN  NaN

I have created the code below, which successfully achieves the desired outcome, however if takes a lot of time to compute (I'm testing actually with 100k index).

I would appreciate any help in optimizing the code.

Reproducible code:

import numpy as np
import pandas as pd

# main data to be shifted
var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(1000,10),index=range(0,1000))
df_example.columns = [var_name  "_" str(j) for var_name in var_names for j in range(1, 6)]

# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (1000,1)),index=range(0,1000), columns = ['moving_indicator'])

list_vars_shifted = []
for var in var_names:
    df_vars_shifted = pd.concat([df_ref.loc[:,'moving_indicator'],
                                              df_example.filter(like=var)], axis = 1)
    
    # Shift accoording to month indicator (hence  1) - SLOW
    df_vars_shifted = (df_vars_shifted.apply(lambda x : x.shift(-(int(x['moving_indicator'])) 1) , axis=1)
                                .drop(columns=['moving_indicator']))
    
    list_vars_shifted.append(df_vars_shifted)

# Convert to dataframe
df_all_vars_shifted = pd.concat(list_vars_shifted, axis=1)

CodePudding user response:

How about this? I didn't run the timing tests because I ran out of time. I put some print outs of the looped dataframes to show what is happening. I changed the moving indicator to 0 for not moving, so then periods= can be a 0 so it doesn't shift. The .replace could be dangerous depending on the data, so it is a little rough.

    import numpy as np
    import pandas as pd

    # main data to be shifted
    df_a = pd.DataFrame(np.random.rand(1000,5),index=range(0,1000))
    df_a.columns = [
        var_name  "_" str(j) for var_name in ['a'] for j in range(1, 6)
    ]

    df_b = pd.DataFrame(np.random.rand(1000,5),index=range(0,1000))
    df_b.columns = [
        var_name  "_" str(j) for var_name in ['b'] for j in range(1, 6)
    ]

    # reference index to determine how many places to be shifted
    df_ref = pd.DataFrame(
        np.random.randint(0, 4, size=(1000,1)),
        index=range(0,1000),
        columns=['moving_indicator']
    )

    df_a = df_a.merge(df_ref, how='inner', left_index=True, right_index=True)
    df_grp = df_a.groupby('moving_indicator')
    new_df_a = pd.DataFrame([])
    for indicator, gdf in df_grp:
        indicator
        indicator = indicator * -1
        gdf.shift(periods=indicator, axis=1)
        gdf = gdf.shift(periods=indicator, axis=1)
        new_df_a = pd.concat([new_df_a, gdf])
    
    new_df_a = new_df_a.sort_index()
    new_df_a = (
        new_df_a.replace({3: np.nan, 2: np.nan, 1: np.nan})
        .drop('moving_indicator', axis=1)
    )

    df_b = df_b.merge(df_ref, how='inner', left_index=True, right_index=True)
    df_grp = df_b.groupby('moving_indicator')
    new_df_b = pd.DataFrame([])
    for indicator, gdf in df_grp:
        indicator
        indicator = indicator * -1
        gdf.shift(periods=indicator, axis=1)
        gdf = gdf.shift(periods=indicator, axis=1)
        new_df_b = pd.concat([new_df_b, gdf])
    
    new_df_b = new_df_b.sort_index()
    new_df_b = (
        new_df_b.replace({3: np.nan, 2: np.nan, 1: np.nan})
        .drop('moving_indicator', axis=1)
    )

    final_df = new_df_a.merge(
        new_df_b, how='inner', left_index=True, right_index=True
    )

Edit: Here are the timings. Question version:

>>> print(timeit.repeat(dummy, repeat=5, number=1))
[0.1520585000034771, 0.1450397999942652, 0.1416596999988542,
0.14743759999691974, 0.14560850000270875]

My version:

>>> print(timeit.repeat(my_func, repeat=5, number=1))
[0.022981900001468603, 0.0159782000046107, 0.01633900000160793,
0.015842399996472523, 0.01663669999834383]

CodePudding user response:

I tried different ways, and the best one was is to use list comprehension shift dataframe.where():

var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(10000,20),index=range(0,10000))
df_example.columns = [var_name  "_" str(j) for var_name in var_names for j in range(1, 11)]

# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (10000,1)),index=range(0,10000), columns = ['moving_indicator'])

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)

    list_shifted_variables = [df_vars.shift(-(indicator) 1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all') for indicator in np.unique(df_vars['moving_indicator'])]
    df_vars_shifted = pd.concat(list_shifted_variables).sort_index().drop(columns=['moving_indicator'])
    list_vars_shifted.append(df_vars_shifted)

df_all_vars_shifted_6 = pd.concat(list_vars_shifted, axis=1)

Full code with all the different approaches:

import numpy as np
import pandas as pd
import swifter

import time

t1 = time.process_time()

# main data to be shifted
var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(10000,20),index=range(0,10000))
df_example.columns = [var_name  "_" str(j) for var_name in var_names for j in range(1, 11)]

# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (10000,1)),index=range(0,10000), columns = ['moving_indicator'])

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'],
                                              df_example.filter(like=var)], axis = 1)
    
    # Shift accoording to month indicator (hence  1) - SLOW
    df_vars_shifted = (df_vars.apply(lambda x : x.shift(-(int(x['moving_indicator'])) 1) , axis=1)
                                .drop(columns=['moving_indicator']))
    
    list_vars_shifted.append(df_vars_shifted)

# Convert to dataframe
df_all_vars_shifted = pd.concat(list_vars_shifted, axis=1)


elapsed_time1 = time.process_time() - t1
print(elapsed_time1)




t2 = time.process_time()

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'],
                                              df_example.filter(like=var)], axis = 1)
    
    # Shift accoording to month indicator (hence  1) - SLOW
    df_vars_shifted = (df_vars.swifter.apply(lambda x : x.shift(-(int(x['moving_indicator'])) 1) , axis=1)
                                .drop(columns=['moving_indicator']))
    
    list_vars_shifted.append(df_vars_shifted)

# Convert to dataframe
df_all_vars_shifted_2 = pd.concat(list_vars_shifted, axis=1)


elapsed_time2 = time.process_time() - t2
print(elapsed_time2)



t3 = time.process_time()

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
    
    # Shift accoording to month indicator (hence  1) - SLOW
    df_vars_shifted = pd.DataFrame([df_vars.iloc[i].shift(-(int(df_vars.iloc[i,0])) 1) for i in range(len(df_vars))]).drop(columns=['moving_indicator'])
    
    list_vars_shifted.append(df_vars_shifted)

# Convert to dataframe
df_all_vars_shifted_3 = pd.concat(list_vars_shifted, axis=1)


elapsed_time3 = time.process_time() - t3
print(elapsed_time3)



t4 = time.process_time()

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
    
    # Shift accoording to month indicator (hence  1) - SLOW
    df_vars_shifted = pd.DataFrame(row[1].shift(-(int(row[1]['moving_indicator'])) 1) for row in df_vars.iterrows()).drop(columns=['moving_indicator'])
    
    list_vars_shifted.append(df_vars_shifted)

# Convert to dataframe
df_all_vars_shifted_4 = pd.concat(list_vars_shifted, axis=1)


elapsed_time4 = time.process_time() - t4
print(elapsed_time4)


t5 = time.process_time()

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)

    list_test = []
    for indicator in np.unique(df_vars['moving_indicator']):
        df_test10 = df_vars.shift(-(indicator) 1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all')
        list_test.append(df_test10)
        
        df_vars_shifted = pd.concat(list_test).sort_index().drop(columns=['moving_indicator'])
        list_vars_shifted.append(df_vars_shifted)

df_all_vars_shifted_5 = pd.concat(list_vars_shifted, axis=1)

elapsed_time5 = time.process_time() - t5
print(elapsed_time5)



t6 = time.process_time()

list_vars_shifted = []
for var in var_names:
    df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)

    list_shifted_variables = [df_vars.shift(-(indicator) 1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all') for indicator in np.unique(df_vars['moving_indicator'])]
    df_vars_shifted = pd.concat(list_shifted_variables).sort_index().drop(columns=['moving_indicator'])
    list_vars_shifted.append(df_vars_shifted)

df_all_vars_shifted_6 = pd.concat(list_vars_shifted, axis=1)

elapsed_time6 = time.process_time() - t6
print(elapsed_time6)
  • Related