Pandas Read in only specific lines of a CSV file-CodePudding

I have a very large CSV that takes ~30 seconds to read when using the normal pd.read_csv command. Is there a way to speed this process up? I'm thinking maybe something that only reads rows that have some matching value in one of the columns. i.e. only read in rows where the value in column 'A' is the value '5'.

CodePudding user response：

Dask module can do a lazy read of a large CSV file in Python.

You trigger the computation by calling the .compute() method. At this time the file is read in chunks and applies whatever conditional logic you specify.

import dask.dataframe as dd

df = dd.read_csv(csv_file)
df = df[df['A'] == 5]

df = df.compute()
print(len(df)) # print number of records

print(df.head()) # print first 5 rows to show sample of data

CodePudding user response：

If you're looking for a value in a CSV file, you must look for the entire document, then limit it to 5 results.

If you want to just retrieve the first five rows, you may are looking for this:

nrows :int,optional

Number of rows of file to read. Useful for reading pieces of large files.

Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html