Load a column from a TSV file into a python list-CodePudding

I want to load the values from the "category" column into a pandas df, this is my tsv file:

Tagname   text  category
j245qzx_8   hamburger toppings   f
h833uio_7   side of fries   f
d423jin_2   milkshake combo   d

This is my code:

with open(filename, 'r') as f:
    df = pd.read_csv(f, sep='\t')
    categoryColumn = df["category"]

    categoryList = []
    for line in categoryColumn:
        categoryColumn.append(line)

However I get a UnicodeDecodeError for the line df = pd.read_csv(f, sep='\t') and my code stops there:

File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2101, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 898: invalid start byte

Any ideas why or how to fix this? It doesn't seem like there's any special characters in my tsv so I'm not sure what's causing this or what to do.

CodePudding user response：

The fix

Just read this SO, and I think I see what's wrong.

You're getting a file handle with Python's open() and passing that to Pandas's read_csv(). open() determines the file's encoding.

So, try setting the encoding in open(), like this:

with open(filename, 'r', encoding='windows-1252') as f:
    df = pd.read_csv(f, sep='\t')
    categoryColumn = df["category"]

    categoryList = []
    for line in categoryColumn:
        categoryColumn.append(line)

Or, don't use open() at all:

df = pd.read_csv(filename, sep='\t', encoding='windows-1252')
categoryColumn = df["category"]

categoryList = []
for line in categoryColumn:
    categoryColumn.append(line)

Some of the back story

I echo'ed x89 into the end of your sample, then ran Python's chardetect utility, and it's suggesting it's Window-1252:

% echo -e '\x89' >> sample.csv

% cat sample.csv 
Tagname text    category
j245qzx_8       hamburger toppings      f
h833uio_7       side of fries   f
d423jin_2       milkshake combo d
�

% which chardetect
/Library/Frameworks/Python.framework/Versions/3.9/bin/chardetect

% chardetect sample.csv 
sample.csv: Windows-1252 with confidence 0.73