How to make the dataframe faster by using dictionary or numpy?-CodePudding

I am new to data structures and I would like to make my code faster since it is part of a bigger code. However, using dataframes while looking up variables is slowing down the search. I have two daraframes as

df = pd.DataFrame({ 'id1':['A', 'B', 'C','D','J','K','H'], 'id2':[['K','J'], 'NaN',['A'],['B','C'],'NaN','NaN',['A']],'length':[2,NaN,1,2,NaN,NaN,1]})

where length is number of elements in the list (id2), and

df2 = pd.DataFrame({ 'col1':['B', 'C', 'A','K','J','A'], 'col2':['D', 'D', 'C','A','A','H'],'type':['FF', 'SS', 'FS','SF','FS','FF']})

the part which is slow is

for key in ['D']:
    if df.loc[df['id1'] == key, 'id2'].any():
        print('continue')
        for pred in df[df['id1']==key]['id2'].values[0]:
            reltype=df2.loc[(df2['col1'] == pred) & (df2['col2']==key), 'type'].values[0]
            print(reltype)

What I did is to convert the df into dictionary and rewrite my code as below but I do not how to rewrite the 'reltype' part and make it faster.

df_dic=df.groupby('id1')[['id2', 'length']].apply(lambda g: g.values.tolist()).to_dict()

for key in ['D']:
    if ~np.isnan(df_dic[key][0][1]):
        print('continue')
        for pred in df_dic[key][0][0]:
            reltype=df2.loc[(df2['col1'] == pred) & (df2['col2']==key), 'type'].values[0]
            print(reltype)

My way of thinking to convert it to dictionary might be slow too. However, I am interested in making the code faster especially the search in finding the 'reltype'. Thanks

CodePudding user response：

I think your first implementation should be fast. There might be a small improvement if you don't repeat slicing:

for key in ['D']:
    s = df.loc[df['id1'] == key, 'id2']
    if s.any():
        print('continue')
        df_temp = df2.loc[df2['col2'] == key, :]
        for pred in s.values[0]:
            if pred != NaN:
                reltype = df_temp.loc[df_temp['col1'] == pred, 'type'].values[0]
                print(reltype)

depending on your data using df_temp might be slower than using a Boolean mask for the slicing

CodePudding user response：

The goal is to minimize the creation of new dataframes (new memory allocations for arrays of each column) with each operation.

You can

convert your keys to lookup into a set (hashset)
convert your df2 into a lookup dictionary (hashmap)

Like this:

keys_to_search = {'D'}
lookup_dict = {(row.col1, row.col2): row.type for row in df2.itertuples()}

# Linear scan to search for rows to process and process them
for row in df.itertuples():
    if row.id1 in keys_to_search:
        # Lookup the answer from your hashmap for each id2 in the list
        for id2 in row.id2:
            result = lookup_dict[(id2, row.id1)]
            print(result)

Just to show how the memory allocations are the operations that are slowing you down:

keys_to_search = {'D'}
lookup_dict = {(row.col1, row.col2): row.type for row in df2.itertuples()}

# 1x memory allocation for the new sub-dataframe
df_to_process = df[df['id1'].isin(keys_to_search)]

for row in df_to_process.itertuples():
    for id2 in row.id2:
        result = lookup_dict[(id2, row.id1)]
        print(result)

Gives the comparison times:

# First method without new memory allocation for any sub-dataframes
543 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# The one with 1x memory allocation for the sub-dataframe
1.06 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Just a single memory allocation is gonna slow you down compared to the first method without. Both are still faster than the original implementation involving multiple lookups.

# Original implementation
1.53 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)