Home > Net >  What happens when pandas read_csv is run on a file that is too large
What happens when pandas read_csv is run on a file that is too large

Time:03-09

If a file fed into pandas read_csv is too large, will it raise an exception? What I'm afraid of is that it will just read what it can, say the first 1,000,000 rows and proceed as if there was no problem.

Does there exist situations in which pandas will fail to read all records in a file but also fail to raise an exception (print errors).

CodePudding user response:

I had issues with pandas once where I tried to open a very large dataset and my kernel crashed. I eventually used PySpark. It is not hard to use and you can easily port between PySpark and Pandas.

CodePudding user response:

If you have large dataset, and if you want to read it manytimes, I recommend you to use .pkl file

Or you can use try exception method.

However, if you still want to use csv file, you can visit this link and find solution How do I read a large csv file with pandas?

CodePudding user response:

I'd recommend using dask which is a high-level library that supports parallel computing,

You can easily import all your data but it won't be loaded in your memory

import pandas as pd
import dask.dataframe as dd

df = dd.read_csv('data.csv')
df

and from there , you can compute only selected columns/rows you are interested in:

df_selected = df[columns].loc[indices_to_select]

df_selected.compute()
  • Related