I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this:
import dask.dataframe as dd
dask_merge_bodytextknown5 = dd.from_pandas(merge_bodytextknown5, npartitions=20) # Dask DataFrame has 20 partitions
dask_merge_bodytextknown5.to_csv('df_complete_emakg_*.csv')
#merge_bodytextknown5.to_csv('df_complete.zip', compression={'method': 'zip', 'archive_name': 'df_complete_emakg.csv'})
However when I am trying to drop some of the rows e.g. by doing:
merge_bodytextknown5.drop(merge_bodytextknown5.index[merge_bodytextknown5['confscore'] == 3], inplace = True)
the kernel suddenly stops. So my questions are:
- is there a way to drop the desired rows using Dask (or another way that prevents the crush of the kernel)?
- do you know a way to lighten the dataset or deal with it in python (e.g. doing some basic descriptive statistics in parallel) other than dropping observations?
- do you know a way to export the pandas db as a csv in parallel without saving the n partition separately (as done by Dask)?
Thank you
CodePudding user response:
Dask dataframes do not support the inplace kwarg, since each partition and subsequent operations are delayed/lazy. However, just like in Pandas, it's possible to assign the result to the same dataframe:
df = merge_bodytextknown5 # this line is for easier readability
mask = df['confscore'] != 3 # note the inversion of the requirement
df = df[mask]
If there are multiple conditions, mask
can be redefined, for example to test two values:
mask = ~df['confscore'].isin([3,4])
Dask will keep track of the operations, but, crucially, will not launch computations until they are requested/needed. For example, the syntax to save a csv file is very much pandas
-like:
df.to_csv('test.csv', index=False, single_file=True) # this save to one file
df.to_csv('test_*.csv', index=False) # this saves one file per dask dataframe partition