The TSV file here https://www.imf.org/-/media/Files/Publications/WEO/WEO-Database/2022/WEOApr2022all.ashx which comes from https://www.imf.org/en/Publications/WEO/weo-database/2022/April/download-entire-database will not open on Pandas. I have tried several things: using a separator for tabs, opening it with read_excel (the website says it's compatible with all modern systems).
import pandas as pd
path = "C:..\\WEOApr2022all.xls"
dataframe = pd.read_csv(path, sep="\\t", encoding='windows-1252')
Error: chunks = self._reader.read_low_memory(nrows) File "pandas_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory File "pandas_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows File "pandas_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 14
CodePudding user response:
Despite its file extension it is a tab delimited, UTF16 encoded text file and can be read using
dataframe = pd.read_csv(path, sep='\\t', encoding='utf_16le', engine='python', skipfooter=2)
8624 rows × 58 columns
Note: You need to specify the delimiter as a regex ('\\t
' or r'\t'
) instead of a single character ('\t'
) and hence use the python
engine because each lines ends with a delimiter. If you use '\t'
you'll get an extra empty column at the end.
CodePudding user response:
That's a tab-separated file with a fake extension with extra problems:
- it uses the UTF16-LE encoding without a BOM
- it has a trailing tab at the end which should result in an empty column.
- Finally, it has a one-line footer at the end
International Monetary Fund, World Economic Outlook Database, April 2022
You can read this using :
df=pd.read_table(r"C:\Users\pankan\Downloads\WEOApr2022all.xls",
sep='\t',
encoding='utf_16le',
usecols=lambda c: not c.startswith('Unnamed:'),
skipfooter=1)
or
df=pd.read_table(r"C:\Users\pankan\Downloads\WEOApr2022all.xls",
encoding='utf_16le',
usecols=lambda c: not c.startswith('Unnamed:'),
skipfooter=1)
read_csv
drops empty lines by default
Some extremely lazy developers try to fake Excel files by generating a CSV or even HTML table and send it as a file with a fake .xls extension. This may fool users, but not Excel itself. Excel will recognize this is a text file and try to import it using the user's defaults for separators, number and date formats. If those don't match the developer's formats, you'll get errors.