I have large piece of data that is problematic to load entirely to memory so I have decided to read it row-by-row, picking desired data, making transformations etc. and then clearing variables and pick another row.
It works fine while I am using csv.reader.
source_file = open(path_to_source_file, encoding="utf8", newline='')
spamreader = csv.reader(source_file , delimiter=";", quotechar='"')
for id, row in enumerate(spamreader):
variable1 = row[0]
variable2 = row[1]
variable3 = row[2]
variable4 = row[3]
...
variable25 = row[26]
...
if id00000==0: print(f"Number of processed rows: {id}")
However, I am afraid that in the future some columns may be added/removed/swapped and adressing them by row[id] is not "safe" option.
I was thinking if there is possibility to load them with pandas because dataframe allows addressing by columname:
for id, row in df.iterrows():
variable1 = row['column0']
variable2 = row['column1']
variable3 = row['column2']
variable4 = row['column3']
...
variable25 = row['column25']
...
if id00000==0: print(f"Number of processed rows: {id}")
However I wasn't able to make it work in comparable speed.
I have tried use chunksize=1, but even "pass" took very long time to execute:
with pd.read_csv(path_to_source_file, sep=";", quotechar='"', chunksize=1) as reader:
for chunk in reader:
pass
I've decided also to try nested fors and picking bigger chunks and then iterating over rows:
with pd.read_csv(path_to_source_file, sep=";", quotechar='"', chunksize=10000) as reader:
for chunk in reader:
for id, row in chunk.iterrows():
pass
but this is also really slow if compared to read_csv version.
So,to summarize my question: Is there any way to load data from csv file with pandas row-by-row to get comparatable speed to csv.reader?
CodePudding user response:
the csv
module also provides the DictReader
method.
reader = csv.DictReader(csv_file)
print(reader.fieldnames)
by default, columns names are inferred from the first row, alternatively you can specify what they should be by passing a sequence with the fieldnames
parameter.
CodePudding user response:
After checking @el_oso answer to use DictReader I have tested how fast it works.
To summarize my test:
- csv.reader took 1min 57s to load part of my data.
- csv.DictReader took 2min 32s to load the same data.
Because i don't want to lose performance I've decided to make some work-around with csv.reader to get column list while reading first row:
source_file = open(path_to_source_file, encoding="utf8", newline='')
spamreader = csv.reader(source_file , delimiter=";", quotechar='"')
for id, row in enumerate(spamreader):
if id==0:
cols={}
for col_id, column in enumerate(row):
cols[column]=col_id
else:
variable1 = row[cols['column0']]
variable2 = row[cols['column1']]
variable3 = row[cols['column2']]
variable4 = row[cols['column3']]
...
variable25 = row[cols['column24']]
...
if id00000==0: print(f"Number of processed rows: {id}")
- Using this method on same data took 1min54s (which is similar to csv.reader)
However, I am still open for any faster solutions if there will be any ideas.