Reading csv file row-by-row in pandas-CodePudding

I have large piece of data that is problematic to load entirely to memory so I have decided to read it row-by-row, picking desired data, making transformations etc. and then clearing variables and pick another row.

It works fine while I am using csv.reader.

    source_file = open(path_to_source_file, encoding="utf8", newline='')
    spamreader = csv.reader(source_file , delimiter=";", quotechar='"')

    for id, row in enumerate(spamreader):
        variable1 = row[0]
        variable2 = row[1]
        variable3 = row[2]
        variable4 = row[3]
        ...
        variable25 = row[26]
        ...
        if id00000==0: print(f"Number of processed rows: {id}")

However, I am afraid that in the future some columns may be added/removed/swapped and adressing them by row[id] is not "safe" option.

I was thinking if there is possibility to load them with pandas because dataframe allows addressing by columname:

    for id, row in df.iterrows():
        variable1 = row['column0']
        variable2 = row['column1']
        variable3 = row['column2']
        variable4 = row['column3']
        ...
        variable25 = row['column25']
        ...
        if id00000==0: print(f"Number of processed rows: {id}")

However I wasn't able to make it work in comparable speed.

I have tried use chunksize=1, but even "pass" took very long time to execute:

with pd.read_csv(path_to_source_file, sep=";", quotechar='"', chunksize=1) as reader:
    for chunk in reader:
        pass

I've decided also to try nested fors and picking bigger chunks and then iterating over rows:

with pd.read_csv(path_to_source_file, sep=";", quotechar='"', chunksize=10000) as reader:
    for chunk in reader:
        for id, row in chunk.iterrows():
            pass

but this is also really slow if compared to read_csv version.

So,to summarize my question: Is there any way to load data from csv file with pandas row-by-row to get comparatable speed to csv.reader?

CodePudding user response：

the csv module also provides the DictReader method.

reader = csv.DictReader(csv_file)
print(reader.fieldnames)

by default, columns names are inferred from the first row, alternatively you can specify what they should be by passing a sequence with the fieldnames parameter.

python csv module

CodePudding user response：

After checking @el_oso answer to use DictReader I have tested how fast it works.

To summarize my test:

csv.reader took 1min 57s to load part of my data.
csv.DictReader took 2min 32s to load the same data.

Because i don't want to lose performance I've decided to make some work-around with csv.reader to get column list while reading first row:

    source_file = open(path_to_source_file, encoding="utf8", newline='')
    spamreader = csv.reader(source_file , delimiter=";", quotechar='"')

    for id, row in enumerate(spamreader):
        if id==0:
            cols={}
            for col_id, column in enumerate(row):
                cols[column]=col_id
        else:
            variable1 = row[cols['column0']]
            variable2 = row[cols['column1']]
            variable3 = row[cols['column2']]
            variable4 = row[cols['column3']]
            ...
            variable25 = row[cols['column24']]
            ...
            if id00000==0: print(f"Number of processed rows: {id}")

Using this method on same data took 1min54s (which is similar to csv.reader)

However, I am still open for any faster solutions if there will be any ideas.