Home > other >  Efficient way to manipulate values of a pandas dataframe
Efficient way to manipulate values of a pandas dataframe

Time:03-04

I am dealing with 2 huge dataframes and I need to perform a specific operation to retrieve the most frequent value of one of the two dataframe for each unique id in the first one. I will explain it better with an example.

Suppose I have to dataframes, the first one will be called df_id, the second one df_values

df_values:

    ids values
0   id1 10
1   id2 20
2   id1 10
3   id1 30
4   id2 40
...

df_id:

    ids desc
0   id1 a product
1   id2 a product
2   id3 a product

In the real dataframes I have additional columns, but for the sake of clarity I omitted them.

Now, df_id contains references of all the ids I need, while df_values contains the (multiple) values associated to each id.

My scope is to create a dictionary that, for each distinct id, report the most frequent value. In case it does not appear in df_values, then I write a None instead of a value.

values_dict

{'id1': 10, 'id2': 20, 'id3': None}

My attempt to solve this was through something like that:

from collections import Counter
import numpy as np

def Most_Common(lst):
    data = Counter(lst)
    return data.most_common(1)[0][0]

dict_val = {}

for ar in pd.unique(df_id['ids']):
    df_art = df_values.loc[df_values['ids'] == ar]
    print("Done", ar)
    val = Most_Common(df_art['values']) if not df_art.empty else None
    dict_val[ar] = val

Code of Most_Common function from this answer

However, this solution seems to not escalate well if my dataframes are VERY large, which is my case. I even tried to use multiprocessing library:

from collections import Counter
import multiprocessing as mp
import numpy as np

dict_val = {}

unique_ids = pd.unique(df_id["ids"])
unique_ids = np.array_split(unique_ids, 5)

def register_value(ids):
    for ar in ids:
        df_art = df_values.loc[df_values['ids'] == ar]
        print("Done", ar)
        val = Most_Common(df_art['values']) if not df_art.empty else None
        dict_val[ar] = val

with mp.Pool(processes = 5) as p:
        p.map(register_value, unique_ids)

But the code is still running for about 40 minutes... maybe I have done something wrong. Is there an efficient solution, which may be also applied in case of multiprocessing? Thank you.

CodePudding user response:

My suggestion:

  1. Find most frequent value in df_values, grouped by ids. If there is more than one most frequent value, take the first one:

    most_freq = df_values.groupby('ids').agg(lambda x: pd.Series.mode(x)[0])['values'].to_dict()
    
  2. Create a dictionary from ids in df_id and assign None to each index:

    dict_val = dict.fromkeys(df_id['ids'])
    
  3. Update second dictionary with the first one. All 'None' will be replaced with most frequent value:

    dict_val.update(most_freq)
    

Output:

{'id1': 10, 'id2': 20, 'id3': None}

It's only a sample, so it's hard to guess how much it can improve the performance. It should be much faster as we aren't iterating over all elements. Please check and let me know.

  • Related