Pandas won't open weird TSV file-CodePudding

The TSV file here https://www.imf.org/-/media/Files/Publications/WEO/WEO-Database/2022/WEOApr2022all.ashx which comes from https://www.imf.org/en/Publications/WEO/weo-database/2022/April/download-entire-database will not open on Pandas. I have tried several things: using a separator for tabs, opening it with read_excel (the website says it's compatible with all modern systems).

import pandas as pd

path = "C:..\\WEOApr2022all.xls"

dataframe = pd.read_csv(path, sep="\\t", encoding='windows-1252')

Error: chunks = self._reader.read_low_memory(nrows) File "pandas_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory File "pandas_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows File "pandas_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 14

CodePudding user response：

Despite its file extension it is a tab delimited, UTF16 encoded text file and can be read using

dataframe = pd.read_csv(path, sep='\\t', encoding='utf_16le', engine='python', skipfooter=2)

8624 rows × 58 columns

Note: You need to specify the delimiter as a regex ('\\t' or r'\t') instead of a single character ('\t') and hence use the python engine because each lines ends with a delimiter. If you use '\t' you'll get an extra empty column at the end.

CodePudding user response：

That's a tab-separated file with a fake extension with extra problems:

it uses the UTF16-LE encoding without a BOM
it has a trailing tab at the end which should result in an empty column.
Finally, it has a one-line footer at the end

International Monetary Fund, World Economic Outlook Database, April 2022

You can read this using :

df=pd.read_table(r"C:\Users\pankan\Downloads\WEOApr2022all.xls",
                 sep='\t',
                 encoding='utf_16le',
                 usecols=lambda c: not c.startswith('Unnamed:'),
                 skipfooter=1)

df=pd.read_table(r"C:\Users\pankan\Downloads\WEOApr2022all.xls",
                 encoding='utf_16le',
                 usecols=lambda c: not c.startswith('Unnamed:'),
                 skipfooter=1)

read_csv drops empty lines by default

Some extremely lazy developers try to fake Excel files by generating a CSV or even HTML table and send it as a file with a fake .xls extension. This may fool users, but not Excel itself. Excel will recognize this is a text file and try to import it using the user's defaults for separators, number and date formats. If those don't match the developer's formats, you'll get errors.