Home > Software design >  Pandas read_csv produces unexpected behavior, why?
Pandas read_csv produces unexpected behavior, why?

Time:10-21

I've got a large tab-separated file and am trying to load it using

import pandas as pd

df = pd.read_csv(..., sep="\t")

however, the process crashes with the error being

pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 1743925, saw 12

Nothing apparent wrong with that particular line when I printed it out manually. Feeling confident that there is nothing wrong with my file, I went and tried to calculate the field counts myself...

from collections import Counter 

lengths = []
with open(...) as f:
    for line in f:
        lengths.append(len(line.split('\t')))

c = Counter(lengths)
print(c)

...and got the result Counter({8: 2385674}). So I was wondering what does pandas do differently, but the error is raised inside a .pyx file and hence I cannot plant a breakpoint there. What could be the cause of this? Where is my expectation flawed?

CodePudding user response:

Fixed the issue. Turns out the problem was a different quoting on csv export and read. The issue was solved by matching quoting on read_csv with quoting on the to_csv which created the loaded file. I assume some tabs and newlines were thought to be parts of string literals because of this, hence the assumption of 11 tab characters on one row (they were actually 2 rows).

  • Related