I have a very large CSV that takes ~30 seconds to read when using the normal pd.read_csv
command. Is there a way to speed this process up? I'm thinking maybe something that only reads rows that have some matching value in one of the columns.
i.e. only read in rows where the value in column 'A' is the value '5'.
CodePudding user response:
Dask module can do a lazy read of a large CSV file in Python.
You trigger the computation by calling the .compute()
method. At this time the file is read in chunks and applies whatever conditional logic you specify.
import dask.dataframe as dd
df = dd.read_csv(csv_file)
df = df[df['A'] == 5]
df = df.compute()
print(len(df)) # print number of records
print(df.head()) # print first 5 rows to show sample of data
CodePudding user response:
If you're looking for a value in a CSV file, you must look for the entire document, then limit it to 5 results.
If you want to just retrieve the first five rows, you may are looking for this:
nrows
:int,optional
Number of rows of file to read. Useful for reading pieces of large files.
Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html