Home > Back-end >  Reading a .xls file into a DataFrame using read_html creating a list, not a df
Reading a .xls file into a DataFrame using read_html creating a list, not a df

Time:08-22

At first I tried reading the .xls file like this:

df = pd.read_excel('/Users/KLG/regionlist1.xls')

Which led to this error: ValueError: Excel file format cannot be determined, you must specify an engine manually.

Then I tried this after reading the read_excel documentation:

df = pd.read_excel('/Users/KLG/regionlist1.xls', engine='xlrd')

Which led to this error: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<table c'

When opening the file in a text editor, I realized that the file is not actually corrupted, but rather html stored in an excel file somehow:

<table cellspacing="0" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dg0" style="border-collapse:collapse;">
<tr>
    <td>Licensee Name</td><td>Program Number</td><td>Program Name</td><td>Program Address 1</td><td>Program Address 2</td><td>Town</td><td>Program Zip</td><td>Region</td><td>Description</td><td>Program Phone</td><td>License Expiration Date</td><td>License First Issue Date</td><td>License Number</td><td>Capacity</td><td>CCRR Name</td><td>CCRR Phone</td><td>Infant Capacity</td><td>Toddler Capacity</td><td>Preschool Capacity</td><td>Kindergarten Capacity</td><td>SchoolAge Capacity</td><td>Toddler and Preschool MixedGroup Capacity</td><td>Infant and Toddler MixedGroup Capacity</td><td>Kindergarten and SchoolAge MixedGroup Capacity</td><td>Preschool and SchoolAge MixedGroup Capacity</td><td>Limited School Age Capacity</td><td>Major Town Code</td><td>Program Type</td><td>WebSiteUrl</td>
</tr><tr>

df = pd.read_html('/Users/KLG/regionlist1.xls')

df is now a list object that looks like this (this also means that df.head() would throw an AttributeError):

df

Is this an issue with the encoding parameters for read_html? If it is, I don't know which encoding issue to use, or if this is an entirely different issue. Any help is appreciated!

CodePudding user response:

Since Michael Delgado mentioned that pd.read_html returns a list of data frames, I tried this and it worked for me:

df = pd.read_html('/Users/KLG/regionlist1.xls', header=0)

df[0].head()
  • Related