Home > Net >  How to read a large file as Pandas dataframe?
How to read a large file as Pandas dataframe?

Time:08-01

I want to read a large file (4GB) as a Pandas dataframe. Since using Dask directly still consumes maximum CPU, I read the file as a pandas dataframe, then use dask_cudf, and then convert back to a pandas dataframe.

However, my code is still using maximum CPU on Kaggle. GPU accelerator is switched on.

import pandas as pd 
from dask import dataframe as dd
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)

df = pd.read_csv("../input/subtype-nt/meth_subtype_normal_tumor.csv", sep="\t", index_col=0)
ddf = dask_cudf.from_cudf(df, npartitions=2)
meth_sub_nt = ddf.infer_objects()

CodePudding user response:

Right now your code suggests that you first attempt to load data using pandas and then convert it to dask-cuDF dataframe. That's not optimal (or might not even be feasible). Instead, one can use dask_cudf.read_csv function (see docs):

from dask_cudf import read_csv

ddf = read_csv('example_output/foo_dask.csv')

CodePudding user response:

I have had similar problem. With some research, I came to know about Vaex.

You can read about its performance here and here.

Essentially this is what you can try to do:

  1. Read the csv file using Vaex and convert it to a hdf5 file (file format most optimised for Vaex)

    vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv', convert=True, chunk_size=5_000)
    
  2. Open the hdf5 file using Vaex. Vaex will do the memory-mapping and thus will not load data into memory.

    vaex_df = vaex.open('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5')
    

Now you can perform operations on your Vaex dataframe just like you would be doing with Pandas. It will be blazingly fast and you will certainly notice huge performance gains (lower CPU and memory usage).

You can also try to read your csv file directly into Vaex dataframe without converting it to hdf5. I had read somewhere that Vaex works fastest with hdf5 files therefore I suggested the above approach.

vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5', chunk_size=5_000)
  • Related