I want to read a large file (4GB) as a Pandas dataframe. Since using Dask directly still consumes maximum CPU, I read the file as a pandas dataframe, then use dask_cudf
, and then convert back to a pandas dataframe.
However, my code is still using maximum CPU on Kaggle. GPU accelerator is switched on.
import pandas as pd
from dask import dataframe as dd
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
df = pd.read_csv("../input/subtype-nt/meth_subtype_normal_tumor.csv", sep="\t", index_col=0)
ddf = dask_cudf.from_cudf(df, npartitions=2)
meth_sub_nt = ddf.infer_objects()
CodePudding user response:
Right now your code suggests that you first attempt to load data using pandas
and then convert it to dask-cuDF
dataframe. That's not optimal (or might not even be feasible). Instead, one can use dask_cudf.read_csv
function (see docs):
from dask_cudf import read_csv
ddf = read_csv('example_output/foo_dask.csv')
CodePudding user response:
I have had similar problem. With some research, I came to know about Vaex.
You can read about its performance here and here.
Essentially this is what you can try to do:
Read the
csv
file using Vaex and convert it to ahdf5
file (file format most optimised for Vaex)vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv', convert=True, chunk_size=5_000)
Open the
hdf5
file using Vaex. Vaex will do the memory-mapping and thus will not load data into memory.vaex_df = vaex.open('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5')
Now you can perform operations on your Vaex dataframe just like you would be doing with Pandas. It will be blazingly fast and you will certainly notice huge performance gains (lower CPU and memory usage).
You can also try to read your csv
file directly into Vaex dataframe without converting it to hdf5
. I had read somewhere that Vaex works fastest with hdf5
files therefore I suggested the above approach.
vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5', chunk_size=5_000)