I run association rules using efficient_apriori
package in Python. I need to save results. My idea is:
- convert rules into pandas df
- save df to hdf5.
Problem is that I have 4 bilions of rules. And following code crashes on signal 9: SIGKILL
:
data = [dict(**rule.__dict__, confidence=rule.confidence, support=rule.support) for rule in rules] # HERE MY CODE CRASHES!!!!!
vystup_df = pd.DataFrame(data)
ROZDELENI = 11
n = math.ceil(len(vystup_df) / ROZDELENI)
for i in range(ROZDELENI):
uloz = vystup_df.loc[i*n : i*n n]
uloz.to_hdf('rules.h5', key='uloz', mode='a')
How can I improve my code, please? I need to save results by small parts I think, but I don`t know how. Thanks
CodePudding user response:
You can save your file in chunks instead of trying to save it all at once.
Assuming you can slice rules
as if it was a list,
chunksize = 1000
for start_index in range(0, len(rules), chunksize):
rules_slice = rules[start_index: start_index chunksize]
data = [dict(**rule.__dict__, confidence=rule.confidence, support=rule.support) for rule in rules_slice]
vystup_df = pd.DataFrame(data)
ROZDELENI = 11
n = math.ceil(len(vystup_df) / ROZDELENI)
for i in range(ROZDELENI):
uloz = vystup_df.loc[i*n : i*n n]
uloz.to_hdf('rules.h5', key='uloz', mode='a')
The key here is that you can change chunksize to a size your machine can handle, and you shouldn't run out of memory. The drawback is that the smaller chunksize is, the more time it will take.
I'm not an expert on hdf, hence there might be some tweaking to do in order to append the new df's to the existing file (although from a cursory search it seems quite straightforward).
The other solution of course is to buy more ram.