Python's `.loc` is really slow on selecting subsets of Data-CodePudding

I'm having a large multindexed (y,t) single valued DataFrame df. Currently, I'm selecting a subset via df.loc[(Y,T), :] and create a dictionary out of it. The following MWE works, but the selection is very slow for large subsets.

import numpy as np
import pandas as pd

# Full DataFrame
y_max = 50  
Y_max = range(1, y_max 1)

t_max = 100 
T_max = range(1, t_max 1)

idx_max = tuple((y,t) for y in Y_max for t in T_max) 

df = pd.DataFrame(np.random.sample(y_max*t_max), index=idx_max, columns=['Value'])


# Create Dictionary of Subset of Data
y1 = 4
yN = 10
Y = range(y1, yN 1)

t1 = 5
tN = 9
T = range(t1, tN 1)

idx_sub = tuple((y,t) for y in Y for t in T)

data_sub = df.loc[(Y,T), :]  #This is really slow

dict_sub = dict(zip(idx_sub, data_sub['Value']))

# result, e.g. (y,t) = (5,7)
dict_sub[5,7] == df.loc[(5,7), 'Value']

I was thinking of using df.loc[(y1,t1),(yN,tN), :], but it does not work properly, as the second index is only bounded in the final year yN.

CodePudding user response：

One idea is use Index.isin with itertools.product in boolean indexing:

from  itertools import product

idx_sub = tuple(product(Y, T))

dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict()
print (dict_sub)