I want to load the values from the "category" column into a pandas df, this is my tsv file:
Tagname text category
j245qzx_8 hamburger toppings f
h833uio_7 side of fries f
d423jin_2 milkshake combo d
This is my code:
with open(filename, 'r') as f:
df = pd.read_csv(f, sep='\t')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
However I get a UnicodeDecodeError for the line df = pd.read_csv(f, sep='\t')
and my code stops there:
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
self._make_engine(self.engine)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2101, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 898: invalid start byte
Any ideas why or how to fix this? It doesn't seem like there's any special characters in my tsv so I'm not sure what's causing this or what to do.
CodePudding user response:
The fix
Just read this SO, and I think I see what's wrong.
You're getting a file handle with Python's open()
and passing that to Pandas's read_csv()
. open()
determines the file's encoding.
So, try setting the encoding in open()
, like this:
with open(filename, 'r', encoding='windows-1252') as f:
df = pd.read_csv(f, sep='\t')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
Or, don't use open()
at all:
df = pd.read_csv(filename, sep='\t', encoding='windows-1252')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
Some of the back story
I echo'ed x89
into the end of your sample, then ran Python's chardetect
utility, and it's suggesting it's Window-1252:
% echo -e '\x89' >> sample.csv
% cat sample.csv
Tagname text category
j245qzx_8 hamburger toppings f
h833uio_7 side of fries f
d423jin_2 milkshake combo d
�
% which chardetect
/Library/Frameworks/Python.framework/Versions/3.9/bin/chardetect
% chardetect sample.csv
sample.csv: Windows-1252 with confidence 0.73