At first I tried reading the .xls file like this:
df = pd.read_excel('/Users/KLG/regionlist1.xls')
Which led to this error: ValueError: Excel file format cannot be determined, you must specify an engine manually.
Then I tried this after reading the read_excel
documentation:
df = pd.read_excel('/Users/KLG/regionlist1.xls', engine='xlrd')
Which led to this error: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<table c'
When opening the file in a text editor, I realized that the file is not actually corrupted, but rather html stored in an excel file somehow:
<table cellspacing="0" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dg0" style="border-collapse:collapse;">
<tr>
<td>Licensee Name</td><td>Program Number</td><td>Program Name</td><td>Program Address 1</td><td>Program Address 2</td><td>Town</td><td>Program Zip</td><td>Region</td><td>Description</td><td>Program Phone</td><td>License Expiration Date</td><td>License First Issue Date</td><td>License Number</td><td>Capacity</td><td>CCRR Name</td><td>CCRR Phone</td><td>Infant Capacity</td><td>Toddler Capacity</td><td>Preschool Capacity</td><td>Kindergarten Capacity</td><td>SchoolAge Capacity</td><td>Toddler and Preschool MixedGroup Capacity</td><td>Infant and Toddler MixedGroup Capacity</td><td>Kindergarten and SchoolAge MixedGroup Capacity</td><td>Preschool and SchoolAge MixedGroup Capacity</td><td>Limited School Age Capacity</td><td>Major Town Code</td><td>Program Type</td><td>WebSiteUrl</td>
</tr><tr>
df = pd.read_html('/Users/KLG/regionlist1.xls')
df
is now a list
object that looks like this (this also means that df.head()
would throw an AttributeError
):
Is this an issue with the encoding parameters for read_html
? If it is, I don't know which encoding issue to use, or if this is an entirely different issue. Any help is appreciated!
CodePudding user response:
Since Michael Delgado mentioned that pd.read_html
returns a list of data frames, I tried this and it worked for me:
df = pd.read_html('/Users/KLG/regionlist1.xls', header=0)
df[0].head()