I am dealing with 2 huge dataframes and I need to perform a specific operation to retrieve the most frequent value of one of the two dataframe for each unique id in the first one. I will explain it better with an example.
Suppose I have to dataframes, the first one will be called df_id
, the second one df_values
df_values:
ids values
0 id1 10
1 id2 20
2 id1 10
3 id1 30
4 id2 40
...
df_id:
ids desc
0 id1 a product
1 id2 a product
2 id3 a product
In the real dataframes I have additional columns, but for the sake of clarity I omitted them.
Now, df_id
contains references of all the ids I need, while df_values
contains the (multiple) values associated to each id.
My scope is to create a dictionary that, for each distinct id, report the most frequent value. In case it does not appear in df_values
, then I write a None
instead of a value.
values_dict
{'id1': 10, 'id2': 20, 'id3': None}
My attempt to solve this was through something like that:
from collections import Counter
import numpy as np
def Most_Common(lst):
data = Counter(lst)
return data.most_common(1)[0][0]
dict_val = {}
for ar in pd.unique(df_id['ids']):
df_art = df_values.loc[df_values['ids'] == ar]
print("Done", ar)
val = Most_Common(df_art['values']) if not df_art.empty else None
dict_val[ar] = val
Code of Most_Common
function from this answer
However, this solution seems to not escalate well if my dataframes are VERY large, which is my case. I even tried to use multiprocessing
library:
from collections import Counter
import multiprocessing as mp
import numpy as np
dict_val = {}
unique_ids = pd.unique(df_id["ids"])
unique_ids = np.array_split(unique_ids, 5)
def register_value(ids):
for ar in ids:
df_art = df_values.loc[df_values['ids'] == ar]
print("Done", ar)
val = Most_Common(df_art['values']) if not df_art.empty else None
dict_val[ar] = val
with mp.Pool(processes = 5) as p:
p.map(register_value, unique_ids)
But the code is still running for about 40 minutes... maybe I have done something wrong. Is there an efficient solution, which may be also applied in case of multiprocessing? Thank you.
CodePudding user response:
My suggestion:
Find most frequent value in df_values, grouped by ids. If there is more than one most frequent value, take the first one:
most_freq = df_values.groupby('ids').agg(lambda x: pd.Series.mode(x)[0])['values'].to_dict()
Create a dictionary from ids in df_id and assign None to each index:
dict_val = dict.fromkeys(df_id['ids'])
Update second dictionary with the first one. All 'None' will be replaced with most frequent value:
dict_val.update(most_freq)
Output:
{'id1': 10, 'id2': 20, 'id3': None}
It's only a sample, so it's hard to guess how much it can improve the performance. It should be much faster as we aren't iterating over all elements. Please check and let me know.