Home > Mobile >  How to make the dataframe faster by using dictionary or numpy?
How to make the dataframe faster by using dictionary or numpy?

Time:04-19

I am new to data structures and I would like to make my code faster since it is part of a bigger code. However, using dataframes while looking up variables is slowing down the search. I have two daraframes as

df = pd.DataFrame({ 'id1':['A', 'B', 'C','D','J','K','H'], 'id2':[['K','J'], 'NaN',['A'],['B','C'],'NaN','NaN',['A']],'length':[2,NaN,1,2,NaN,NaN,1]})

where length is number of elements in the list (id2), and

df2 = pd.DataFrame({ 'col1':['B', 'C', 'A','K','J','A'], 'col2':['D', 'D', 'C','A','A','H'],'type':['FF', 'SS', 'FS','SF','FS','FF']}) 

the part which is slow is

for key in ['D']:
    if df.loc[df['id1'] == key, 'id2'].any():
        print('continue')
        for pred in df[df['id1']==key]['id2'].values[0]:
            reltype=df2.loc[(df2['col1'] == pred) & (df2['col2']==key), 'type'].values[0]
            print(reltype)

What I did is to convert the df into dictionary and rewrite my code as below but I do not how to rewrite the 'reltype' part and make it faster.

df_dic=df.groupby('id1')[['id2', 'length']].apply(lambda g: g.values.tolist()).to_dict()

for key in ['D']:
    if ~np.isnan(df_dic[key][0][1]):
        print('continue')
        for pred in df_dic[key][0][0]:
            reltype=df2.loc[(df2['col1'] == pred) & (df2['col2']==key), 'type'].values[0]
            print(reltype)

My way of thinking to convert it to dictionary might be slow too. However, I am interested in making the code faster especially the search in finding the 'reltype'. Thanks

CodePudding user response:

I think your first implementation should be fast. There might be a small improvement if you don't repeat slicing:

for key in ['D']:
    s = df.loc[df['id1'] == key, 'id2']
    if s.any():
        print('continue')
        df_temp = df2.loc[df2['col2'] == key, :]
        for pred in s.values[0]:
            if pred != NaN:
                reltype = df_temp.loc[df_temp['col1'] == pred, 'type'].values[0]
                print(reltype)

depending on your data using df_temp might be slower than using a Boolean mask for the slicing

CodePudding user response:

The goal is to minimize the creation of new dataframes (new memory allocations for arrays of each column) with each operation.

You can

  1. convert your keys to lookup into a set (hashset)
  2. convert your df2 into a lookup dictionary (hashmap)

Like this:

keys_to_search = {'D'}
lookup_dict = {(row.col1, row.col2): row.type for row in df2.itertuples()}

# Linear scan to search for rows to process and process them
for row in df.itertuples():
    if row.id1 in keys_to_search:
        # Lookup the answer from your hashmap for each id2 in the list
        for id2 in row.id2:
            result = lookup_dict[(id2, row.id1)]
            print(result)

Just to show how the memory allocations are the operations that are slowing you down:

keys_to_search = {'D'}
lookup_dict = {(row.col1, row.col2): row.type for row in df2.itertuples()}

# 1x memory allocation for the new sub-dataframe
df_to_process = df[df['id1'].isin(keys_to_search)]

for row in df_to_process.itertuples():
    for id2 in row.id2:
        result = lookup_dict[(id2, row.id1)]
        print(result)

Gives the comparison times:

# First method without new memory allocation for any sub-dataframes
543 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# The one with 1x memory allocation for the sub-dataframe
1.06 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Just a single memory allocation is gonna slow you down compared to the first method without. Both are still faster than the original implementation involving multiple lookups.

# Original implementation
1.53 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • Related