Home > Mobile >  Python's `.loc` is really slow on selecting subsets of Data
Python's `.loc` is really slow on selecting subsets of Data

Time:12-10

I'm having a large multindexed (y,t) single valued DataFrame df. Currently, I'm selecting a subset via df.loc[(Y,T), :] and create a dictionary out of it. The following MWE works, but the selection is very slow for large subsets.

import numpy as np
import pandas as pd

# Full DataFrame
y_max = 50  
Y_max = range(1, y_max 1)

t_max = 100 
T_max = range(1, t_max 1)

idx_max = tuple((y,t) for y in Y_max for t in T_max) 

df = pd.DataFrame(np.random.sample(y_max*t_max), index=idx_max, columns=['Value'])


# Create Dictionary of Subset of Data
y1 = 4
yN = 10
Y = range(y1, yN 1)

t1 = 5
tN = 9
T = range(t1, tN 1)

idx_sub = tuple((y,t) for y in Y for t in T)

data_sub = df.loc[(Y,T), :]  #This is really slow

dict_sub = dict(zip(idx_sub, data_sub['Value']))

# result, e.g. (y,t) = (5,7)
dict_sub[5,7] == df.loc[(5,7), 'Value']

I was thinking of using df.loc[(y1,t1),(yN,tN), :], but it does not work properly, as the second index is only bounded in the final year yN.

CodePudding user response:

One idea is use Index.isin with itertools.product in boolean indexing:

from  itertools import product

idx_sub = tuple(product(Y, T))

dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict()
print (dict_sub)
  • Related