We have data representing temperature forecast for every 3 hours period from the moment. We also know number of 3 hour periods after which the weather is needed. So, we have dataframe:
import pandas as pd
d = {'T_forecast': [[11.98, 10.84, 8.74, 6.31, 4.52],[11.29, 7.87, 3.94, 5.02, 7.97],[16.22, 14.87, 11.31, 10.54, 10.72]
,[9.77, 7.54, 5.96, 2.75, 4.99],[18.61, 16.52, 13.52, 11.62, 16.44]], 'delta_hours_divided_by_3': [3, 1,2,3,1]}
df = pd.DataFrame(data=d)
print(df)
T_forecast delta_hours_divided_by_3
0 [11.98, 10.84, 8.74, 6.31, 4.52] 3
1 [11.29, 7.87, 3.94, 5.02, 7.97] 1
2 [16.22, 14.87, 11.31, 10.54, 10.72] 2
3 [9.77, 7.54, 5.96, 2.75, 4.99] 3
4 [18.61, 16.52, 13.52, 11.62, 16.44] 1
We need to create column with particular element from the list in 'T_forecast' columns.The result should be:
T_forecast delta_hours_divided_by_3 T_by_the_shift_start
0 [11.98, 10.84, 8.74, 6.31, 4.52] 3 8.74
1 [11.29, 7.87, 3.94, 5.02, 7.97] 1 11.29
2 [16.22, 14.87, 11.31, 10.54, 10.72] 2 14.87
3 [9.77, 7.54, 5.96, 2.75, 4.99] 3 5.96
4 [18.61, 16.52, 13.52, 11.62, 16.44] 1 18.61
I can get it for particular value, with code:
print(df['T_forecast'][0][df['delta_hours_divided_by_3'][0]-1])
8.74
But I struggle creating column out:
df['T_by_the_shift_start']=df['T_forecast'][df['delta_hours_divided_by_3']-1]
ValueError: cannot reindex on an axis with duplicate labels
Using the for loop is not an option since the original dataframe is very large and the server will choke. What can lead to solving the issue?
CodePudding user response:
Loop solution
df['T_by_the_shift_start'] = [a[b - 1] for a, b in df.to_numpy()]
Non-loop solution
** lists should have same length across all rows
** This solution will perform around 2x better only on large data sets >= 500K
df['T_by_the_shift_start'] = np.array([*df['T_forecast']])[range(len(df)), df['delta_hours_divided_by_3'] - 1]
T_forecast delta_hours_divided_by_3 T_by_the_shift_start
0 [11.98, 10.84, 8.74, 6.31, 4.52] 3 8.74
1 [11.29, 7.87, 3.94, 5.02, 7.97] 1 11.29
2 [16.22, 14.87, 11.31, 10.54, 10.72] 2 14.87
3 [9.77, 7.54, 5.96, 2.75, 4.99] 3 5.96
4 [18.61, 16.52, 13.52, 11.62, 16.44] 1 18.61
CodePudding user response:
here is one way to do it
df['T_by_the_shift_start']=df.apply(lambda x: x['T_forecast'][x['delta_hours_divided_by_3']-1] , axis=1)
df
T_forecast delta_hours_divided_by_3 T_by_the_shift_start
0 [11.98, 10.84, 8.74, 6.31, 4.52] 3 8.74
1 [11.29, 7.87, 3.94, 5.02, 7.97] 1 11.29
2 [16.22, 14.87, 11.31, 10.54, 10.72] 2 14.87
3 [9.77, 7.54, 5.96, 2.75, 4.99] 3 5.96
4 [18.61, 16.52, 13.52, 11.62, 16.44] 1 18.61