I am new to data structures and I would like to make my code faster since it is part of a bigger code. However, using dataframes while looking up variables is slowing down the search. I have two daraframes as
df = pd.DataFrame({ 'id1':['A', 'B', 'C','D','J','K','H'], 'id2':[['K','J'], 'NaN',['A'],['B','C'],'NaN','NaN',['A']],'length':[2,NaN,1,2,NaN,NaN,1]})
where length is number of elements in the list (id2), and
df2 = pd.DataFrame({ 'col1':['B', 'C', 'A','K','J','A'], 'col2':['D', 'D', 'C','A','A','H'],'type':['FF', 'SS', 'FS','SF','FS','FF']})
the part which is slow is
for key in ['D']:
if df.loc[df['id1'] == key, 'id2'].any():
print('continue')
for pred in df[df['id1']==key]['id2'].values[0]:
reltype=df2.loc[(df2['col1'] == pred) & (df2['col2']==key), 'type'].values[0]
print(reltype)
What I did is to convert the df into dictionary and rewrite my code as below but I do not how to rewrite the 'reltype' part and make it faster.
df_dic=df.groupby('id1')[['id2', 'length']].apply(lambda g: g.values.tolist()).to_dict()
for key in ['D']:
if ~np.isnan(df_dic[key][0][1]):
print('continue')
for pred in df_dic[key][0][0]:
reltype=df2.loc[(df2['col1'] == pred) & (df2['col2']==key), 'type'].values[0]
print(reltype)
My way of thinking to convert it to dictionary might be slow too. However, I am interested in making the code faster especially the search in finding the 'reltype'. Thanks
CodePudding user response:
I think your first implementation should be fast. There might be a small improvement if you don't repeat slicing:
for key in ['D']:
s = df.loc[df['id1'] == key, 'id2']
if s.any():
print('continue')
df_temp = df2.loc[df2['col2'] == key, :]
for pred in s.values[0]:
if pred != NaN:
reltype = df_temp.loc[df_temp['col1'] == pred, 'type'].values[0]
print(reltype)
depending on your data using df_temp
might be slower than using a Boolean mask for the slicing
CodePudding user response:
The goal is to minimize the creation of new dataframes (new memory allocations for arrays of each column) with each operation.
You can
- convert your keys to lookup into a set (hashset)
- convert your
df2
into a lookup dictionary (hashmap)
Like this:
keys_to_search = {'D'}
lookup_dict = {(row.col1, row.col2): row.type for row in df2.itertuples()}
# Linear scan to search for rows to process and process them
for row in df.itertuples():
if row.id1 in keys_to_search:
# Lookup the answer from your hashmap for each id2 in the list
for id2 in row.id2:
result = lookup_dict[(id2, row.id1)]
print(result)
Just to show how the memory allocations are the operations that are slowing you down:
keys_to_search = {'D'}
lookup_dict = {(row.col1, row.col2): row.type for row in df2.itertuples()}
# 1x memory allocation for the new sub-dataframe
df_to_process = df[df['id1'].isin(keys_to_search)]
for row in df_to_process.itertuples():
for id2 in row.id2:
result = lookup_dict[(id2, row.id1)]
print(result)
Gives the comparison times:
# First method without new memory allocation for any sub-dataframes
543 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# The one with 1x memory allocation for the sub-dataframe
1.06 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Just a single memory allocation is gonna slow you down compared to the first method without. Both are still faster than the original implementation involving multiple lookups.
# Original implementation
1.53 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)