I know this might be quite a general question but I'll try. I have 3 huge databases (around 5 Million observations each) that I have to merge all together but when I do using
db_cpc_id = pd.merge(df_id_appended, df_cpc_appended, how='left', on='docdb_family_id')
the kernel stops working. Any suggestion on how to avoid the kernel restarting? Maybe using pd.concat() might solve the issue?
Thank you
CodePudding user response:
The first thing you should consider is that merge is memory intensive and that you simply might not have enough RAM to do this operation. Please have a look at Vaex, as this is a fast and easy way to manipulate massive amounts of data. https://vaex.io/. The syntax is not identical but very similar to pandas. In the example below I am assuming you have 5 CSVs that you can load and merge, and then store.
import vaex
vaex_df1 = vaex.from_csv(file1,convert=True, chunk_size=5_000)
vaex_df2 = vaex.from_csv(file2,convert=True, chunk_size=5_000)
joined_df = vaex_df1.join(vaex_df2, how='left', on='docdb_family_id')
Please check your system resources when running your code to get a better understanding of why your kernel is failing :)