May I create a dataframe from a set in Pandas?-CodePudding

relations = []
... 
if rel_dict not in relations:
    relations.append(rel_dict)


rel_df = pd.DataFrame(relations)

This is the code I am creating a dataframe, and the following slows down a lot:

if rel_dict not in relations

If I define 'relations' as a set, can DataFrame take a set to create a dataframe? If I also want to keep the order in the set, can I do it?

Or maybe I should define 'relations' as a dict:

from collections import OrderedDict
relations = OrderedDict()
... 
if rel_dict not in relations:
    relations[rel_dict] =  rel_dict

What's your suggestion?

CodePudding user response：

Two ideas:

First, use frozendict, which is like a python dict but hashable (and immutable). Pandas dataframe constructor does not mind taking set of frozen dicts, so this will work:

frozen_rel_dicts = set(frozendict(rel_dict) for rel_dict in rel_dicts)
df = pd.DataFrame(frozen_rel_dicts)

This will loose the ordering, so to keep it we can replace the set by dict (you are using python 3.6 or later, right?)

frozen_rel_dicts = dict.fromkeys(frozendict(rel_dict) for rel_dict in rel_dicts)
df = pd.DataFrame(frozen_rel_dicts.keys())

Second, feed the whole thing to pandas and deduplicate after

df = pd.DataFrame(rel_dicts).drop_duplicates(keep='first')

Not sure which one is faster, you will have to try.