I have a dataframe divided in variables (a,b) and on time values (1,5). The columns of the dataframe are a combination of the variables and time values ("a_1"). However, I need to transform these time values that are absolute into relative values. For that, I have another dataframe with reference indicator stating how many time values to move.
Therefore, I want to shift the positions of the values I have according to a reference indicator, which is represented by another dataframe, and that changes according to the index.
EX: If for a specific index, the reference indicator is 3, I'd want the data in that index to move to the left until the position 3 goes to a_1 (so is moves 2 (3-1) places), such as: Original:
a_1 a_2 a_3 a_4 a_5
0 0.854592 0.677819 0.071725 0.29312 0.948375
Shifted:
a_1 a_2 a_3 a_4 a_5
0 0.071725 0.29312 0.948375 NaN NaN
I have created the code below, which successfully achieves the desired outcome, however if takes a lot of time to compute (I'm testing actually with 100k index).
I would appreciate any help in optimizing the code.
Reproducible code:
import numpy as np
import pandas as pd
# main data to be shifted
var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(1000,10),index=range(0,1000))
df_example.columns = [var_name "_" str(j) for var_name in var_names for j in range(1, 6)]
# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (1000,1)),index=range(0,1000), columns = ['moving_indicator'])
list_vars_shifted = []
for var in var_names:
df_vars_shifted = pd.concat([df_ref.loc[:,'moving_indicator'],
df_example.filter(like=var)], axis = 1)
# Shift accoording to month indicator (hence 1) - SLOW
df_vars_shifted = (df_vars_shifted.apply(lambda x : x.shift(-(int(x['moving_indicator'])) 1) , axis=1)
.drop(columns=['moving_indicator']))
list_vars_shifted.append(df_vars_shifted)
# Convert to dataframe
df_all_vars_shifted = pd.concat(list_vars_shifted, axis=1)
CodePudding user response:
How about this? I didn't run the timing tests because I ran out of time. I put some print outs of the looped dataframes to show what is happening. I changed the moving indicator to 0 for not moving, so then periods= can be a 0 so it doesn't shift. The .replace could be dangerous depending on the data, so it is a little rough.
import numpy as np
import pandas as pd
# main data to be shifted
df_a = pd.DataFrame(np.random.rand(1000,5),index=range(0,1000))
df_a.columns = [
var_name "_" str(j) for var_name in ['a'] for j in range(1, 6)
]
df_b = pd.DataFrame(np.random.rand(1000,5),index=range(0,1000))
df_b.columns = [
var_name "_" str(j) for var_name in ['b'] for j in range(1, 6)
]
# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(
np.random.randint(0, 4, size=(1000,1)),
index=range(0,1000),
columns=['moving_indicator']
)
df_a = df_a.merge(df_ref, how='inner', left_index=True, right_index=True)
df_grp = df_a.groupby('moving_indicator')
new_df_a = pd.DataFrame([])
for indicator, gdf in df_grp:
indicator
indicator = indicator * -1
gdf.shift(periods=indicator, axis=1)
gdf = gdf.shift(periods=indicator, axis=1)
new_df_a = pd.concat([new_df_a, gdf])
new_df_a = new_df_a.sort_index()
new_df_a = (
new_df_a.replace({3: np.nan, 2: np.nan, 1: np.nan})
.drop('moving_indicator', axis=1)
)
df_b = df_b.merge(df_ref, how='inner', left_index=True, right_index=True)
df_grp = df_b.groupby('moving_indicator')
new_df_b = pd.DataFrame([])
for indicator, gdf in df_grp:
indicator
indicator = indicator * -1
gdf.shift(periods=indicator, axis=1)
gdf = gdf.shift(periods=indicator, axis=1)
new_df_b = pd.concat([new_df_b, gdf])
new_df_b = new_df_b.sort_index()
new_df_b = (
new_df_b.replace({3: np.nan, 2: np.nan, 1: np.nan})
.drop('moving_indicator', axis=1)
)
final_df = new_df_a.merge(
new_df_b, how='inner', left_index=True, right_index=True
)
Edit: Here are the timings. Question version:
>>> print(timeit.repeat(dummy, repeat=5, number=1))
[0.1520585000034771, 0.1450397999942652, 0.1416596999988542,
0.14743759999691974, 0.14560850000270875]
My version:
>>> print(timeit.repeat(my_func, repeat=5, number=1))
[0.022981900001468603, 0.0159782000046107, 0.01633900000160793,
0.015842399996472523, 0.01663669999834383]
CodePudding user response:
I tried different ways, and the best one was is to use list comprehension shift dataframe.where():
var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(10000,20),index=range(0,10000))
df_example.columns = [var_name "_" str(j) for var_name in var_names for j in range(1, 11)]
# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (10000,1)),index=range(0,10000), columns = ['moving_indicator'])
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
list_shifted_variables = [df_vars.shift(-(indicator) 1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all') for indicator in np.unique(df_vars['moving_indicator'])]
df_vars_shifted = pd.concat(list_shifted_variables).sort_index().drop(columns=['moving_indicator'])
list_vars_shifted.append(df_vars_shifted)
df_all_vars_shifted_6 = pd.concat(list_vars_shifted, axis=1)
Full code with all the different approaches:
import numpy as np
import pandas as pd
import swifter
import time
t1 = time.process_time()
# main data to be shifted
var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(10000,20),index=range(0,10000))
df_example.columns = [var_name "_" str(j) for var_name in var_names for j in range(1, 11)]
# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (10000,1)),index=range(0,10000), columns = ['moving_indicator'])
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'],
df_example.filter(like=var)], axis = 1)
# Shift accoording to month indicator (hence 1) - SLOW
df_vars_shifted = (df_vars.apply(lambda x : x.shift(-(int(x['moving_indicator'])) 1) , axis=1)
.drop(columns=['moving_indicator']))
list_vars_shifted.append(df_vars_shifted)
# Convert to dataframe
df_all_vars_shifted = pd.concat(list_vars_shifted, axis=1)
elapsed_time1 = time.process_time() - t1
print(elapsed_time1)
t2 = time.process_time()
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'],
df_example.filter(like=var)], axis = 1)
# Shift accoording to month indicator (hence 1) - SLOW
df_vars_shifted = (df_vars.swifter.apply(lambda x : x.shift(-(int(x['moving_indicator'])) 1) , axis=1)
.drop(columns=['moving_indicator']))
list_vars_shifted.append(df_vars_shifted)
# Convert to dataframe
df_all_vars_shifted_2 = pd.concat(list_vars_shifted, axis=1)
elapsed_time2 = time.process_time() - t2
print(elapsed_time2)
t3 = time.process_time()
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
# Shift accoording to month indicator (hence 1) - SLOW
df_vars_shifted = pd.DataFrame([df_vars.iloc[i].shift(-(int(df_vars.iloc[i,0])) 1) for i in range(len(df_vars))]).drop(columns=['moving_indicator'])
list_vars_shifted.append(df_vars_shifted)
# Convert to dataframe
df_all_vars_shifted_3 = pd.concat(list_vars_shifted, axis=1)
elapsed_time3 = time.process_time() - t3
print(elapsed_time3)
t4 = time.process_time()
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
# Shift accoording to month indicator (hence 1) - SLOW
df_vars_shifted = pd.DataFrame(row[1].shift(-(int(row[1]['moving_indicator'])) 1) for row in df_vars.iterrows()).drop(columns=['moving_indicator'])
list_vars_shifted.append(df_vars_shifted)
# Convert to dataframe
df_all_vars_shifted_4 = pd.concat(list_vars_shifted, axis=1)
elapsed_time4 = time.process_time() - t4
print(elapsed_time4)
t5 = time.process_time()
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
list_test = []
for indicator in np.unique(df_vars['moving_indicator']):
df_test10 = df_vars.shift(-(indicator) 1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all')
list_test.append(df_test10)
df_vars_shifted = pd.concat(list_test).sort_index().drop(columns=['moving_indicator'])
list_vars_shifted.append(df_vars_shifted)
df_all_vars_shifted_5 = pd.concat(list_vars_shifted, axis=1)
elapsed_time5 = time.process_time() - t5
print(elapsed_time5)
t6 = time.process_time()
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
list_shifted_variables = [df_vars.shift(-(indicator) 1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all') for indicator in np.unique(df_vars['moving_indicator'])]
df_vars_shifted = pd.concat(list_shifted_variables).sort_index().drop(columns=['moving_indicator'])
list_vars_shifted.append(df_vars_shifted)
df_all_vars_shifted_6 = pd.concat(list_vars_shifted, axis=1)
elapsed_time6 = time.process_time() - t6
print(elapsed_time6)