There is a CSV file approx. size of 2,5 GB with about 50 columns and 4,5 million rows.
The dataset will be used for different operations, but at once just a few columns are used, therefore I am looking for a high performant algorithm to read only one column in a CSV file.
Reading the file in one chunk takes roughly 38 seconds to read it in a Pandas dataframe.
path = r"C:\my_path\my_csv.csv" pd.read_csv(path, header=0)
Reading only one specific column takes about 14 seconds
pd.read_csv(path, usecols=["my_specific_col"], header=0)
Is there a way to reduce the reading time? As it seems that the number of columns has little effect on the performance.
CodePudding user response:
Since version 1.4.0 of Pandas, there is a new experimental engine for read_csv, relying on the Arrow library’s CSV multithreaded parser instead of the default C parser.
So, this might help to speed things up:
df = pd.read_csv(path, usecols=["my_specific_col"], header=0, engine="pyarrow")