Using one column values as index to list type values in another in pandas-CodePudding

We have data representing temperature forecast for every 3 hours period from the moment. We also know number of 3 hour periods after which the weather is needed. So, we have dataframe:

import pandas as pd
d = {'T_forecast': [[11.98, 10.84, 8.74, 6.31, 4.52],[11.29, 7.87, 3.94, 5.02, 7.97],[16.22, 14.87, 11.31, 10.54, 10.72]
                    ,[9.77, 7.54, 5.96, 2.75, 4.99],[18.61, 16.52, 13.52, 11.62, 16.44]], 'delta_hours_divided_by_3': [3, 1,2,3,1]}
df = pd.DataFrame(data=d)
print(df)

                            T_forecast  delta_hours_divided_by_3
0     [11.98, 10.84, 8.74, 6.31, 4.52]                         3
1      [11.29, 7.87, 3.94, 5.02, 7.97]                         1
2  [16.22, 14.87, 11.31, 10.54, 10.72]                         2
3       [9.77, 7.54, 5.96, 2.75, 4.99]                         3
4  [18.61, 16.52, 13.52, 11.62, 16.44]                         1

We need to create column with particular element from the list in 'T_forecast' columns.The result should be:

                            T_forecast  delta_hours_divided_by_3   T_by_the_shift_start
0     [11.98, 10.84, 8.74, 6.31, 4.52]                         3               8.74       
1      [11.29, 7.87, 3.94, 5.02, 7.97]                         1               11.29
2  [16.22, 14.87, 11.31, 10.54, 10.72]                         2               14.87
3       [9.77, 7.54, 5.96, 2.75, 4.99]                         3               5.96
4  [18.61, 16.52, 13.52, 11.62, 16.44]                         1               18.61

I can get it for particular value, with code:

print(df['T_forecast'][0][df['delta_hours_divided_by_3'][0]-1])

8.74

But I struggle creating column out:

df['T_by_the_shift_start']=df['T_forecast'][df['delta_hours_divided_by_3']-1]

ValueError: cannot reindex on an axis with duplicate labels

Using the for loop is not an option since the original dataframe is very large and the server will choke. What can lead to solving the issue?

CodePudding user response：

Loop solution

df['T_by_the_shift_start'] = [a[b - 1] for a, b in df.to_numpy()]

Non-loop solution

** lists should have same length across all rows

** This solution will perform around 2x better only on large data sets >= 500K

df['T_by_the_shift_start'] = np.array([*df['T_forecast']])[range(len(df)), df['delta_hours_divided_by_3'] - 1]

                            T_forecast  delta_hours_divided_by_3  T_by_the_shift_start
0     [11.98, 10.84, 8.74, 6.31, 4.52]                         3                  8.74
1      [11.29, 7.87, 3.94, 5.02, 7.97]                         1                 11.29
2  [16.22, 14.87, 11.31, 10.54, 10.72]                         2                 14.87
3       [9.77, 7.54, 5.96, 2.75, 4.99]                         3                  5.96
4  [18.61, 16.52, 13.52, 11.62, 16.44]                         1                 18.61

CodePudding user response：

here is one way to do it

df['T_by_the_shift_start']=df.apply(lambda x: x['T_forecast'][x['delta_hours_divided_by_3']-1] , axis=1) 
df

    T_forecast                          delta_hours_divided_by_3    T_by_the_shift_start
0   [11.98, 10.84, 8.74, 6.31, 4.52]        3                       8.74
1   [11.29, 7.87, 3.94, 5.02, 7.97]         1                       11.29
2   [16.22, 14.87, 11.31, 10.54, 10.72]     2                       14.87
3   [9.77, 7.54, 5.96, 2.75, 4.99]          3                       5.96
4   [18.61, 16.52, 13.52, 11.62, 16.44]     1                       18.61