Home > OS >  Most efficient way to read a specific column in large csv file
Most efficient way to read a specific column in large csv file

Time:01-09

There is a CSV file approx. size of 2,5 GB with about 50 columns and 4,5 million rows.

The dataset will be used for different operations, but at once just a few columns are used, therefore I am looking for a high performant algorithm to read only one column in a CSV file.

  1. Reading the file in one chunk takes roughly 38 seconds to read it in a Pandas dataframe.

    path = r"C:\my_path\my_csv.csv"
    pd.read_csv(path, header=0)
    
  2. Reading only one specific column takes about 14 seconds

    pd.read_csv(path, usecols=["my_specific_col"], header=0)

Is there a way to reduce the reading time? As it seems that the number of columns has little effect on the performance.

CodePudding user response:

Since version 1.4.0 of Pandas, there is a new experimental engine for read_csv, relying on the Arrow library’s CSV multithreaded parser instead of the default C parser.

So, this might help to speed things up:

df = pd.read_csv(path, usecols=["my_specific_col"], header=0, engine="pyarrow")
  • Related