relations = []
...
if rel_dict not in relations:
relations.append(rel_dict)
rel_df = pd.DataFrame(relations)
This is the code I am creating a dataframe, and the following slows down a lot:
if rel_dict not in relations
If I define 'relations' as a set, can DataFrame take a set to create a dataframe? If I also want to keep the order in the set, can I do it?
Or maybe I should define 'relations' as a dict:
from collections import OrderedDict
relations = OrderedDict()
...
if rel_dict not in relations:
relations[rel_dict] = rel_dict
What's your suggestion?
CodePudding user response:
Two ideas:
First, use frozendict
, which is like a python dict but hashable (and immutable). Pandas dataframe constructor does not mind taking set of frozen dicts, so this will work:
frozen_rel_dicts = set(frozendict(rel_dict) for rel_dict in rel_dicts)
df = pd.DataFrame(frozen_rel_dicts)
This will loose the ordering, so to keep it we can replace the set
by dict
(you are using python 3.6 or later, right?)
frozen_rel_dicts = dict.fromkeys(frozendict(rel_dict) for rel_dict in rel_dicts)
df = pd.DataFrame(frozen_rel_dicts.keys())
Second, feed the whole thing to pandas and deduplicate after
df = pd.DataFrame(rel_dicts).drop_duplicates(keep='first')
Not sure which one is faster, you will have to try.