I've got a large tab-separated file and am trying to load it using
import pandas as pd
df = pd.read_csv(..., sep="\t")
however, the process crashes with the error being
pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 1743925, saw 12
Nothing apparent wrong with that particular line when I printed it out manually. Feeling confident that there is nothing wrong with my file, I went and tried to calculate the field counts myself...
from collections import Counter
lengths = []
with open(...) as f:
for line in f:
lengths.append(len(line.split('\t')))
c = Counter(lengths)
print(c)
...and got the result Counter({8: 2385674})
. So I was wondering what does pandas
do differently, but the error is raised inside a .pyx
file and hence I cannot plant a breakpoint there. What could be the cause of this? Where is my expectation flawed?
CodePudding user response:
Fixed the issue. Turns out the problem was a different quoting
on csv export and read. The issue was solved by matching quoting
on read_csv
with quoting
on the to_csv
which created the loaded file. I assume some tabs and newlines were thought to be parts of string literals because of this, hence the assumption of 11 tab characters on one row (they were actually 2 rows).