Most efficient way to read a specific column in large csv file-CodePudding

There is a CSV file approx. size of 2,5 GB with about 50 columns and 4,5 million rows.

The dataset will be used for different operations, but at once just a few columns are used, therefore I am looking for a high performant algorithm to read only one column in a CSV file.

Reading the file in one chunk takes roughly 38 seconds to read it in a Pandas dataframe.
```
path = r"C:\my_path\my_csv.csv"
pd.read_csv(path, header=0)
```
Reading only one specific column takes about 14 seconds

pd.read_csv(path, usecols=["my_specific_col"], header=0)

Is there a way to reduce the reading time? As it seems that the number of columns has little effect on the performance.

CodePudding user response：

Since version 1.4.0 of Pandas, there is a new experimental engine for read_csv, relying on the Arrow library’s CSV multithreaded parser instead of the default C parser.

So, this might help to speed things up:

df = pd.read_csv(path, usecols=["my_specific_col"], header=0, engine="pyarrow")