Home > Enterprise >  HTML Detect Table Reading Direction using BeautifulSoup or panda?
HTML Detect Table Reading Direction using BeautifulSoup or panda?

Time:08-11

In html we have 2 types of table; Horizontal and Vertical. Is there a way to detect the type of the table in python?

Maybe this can be done using panda or BeautifulSoup?

<h2>Horizontal Headings:</h2>

<table style="width:100%">
  <tr>
    <th>Name</th>
    <th>Telephone</th>
    <th>Telephone</th>
  </tr>
  <tr>
    <td>Bill Gates</td>
    <td>555 77 854</td>
    <td>555 77 855</td>
  </tr>
</table>

<h2>Vertical Headings:</h2>

<table style="width:100%">
  <tr>
    <th>Name:</th>
    <td>Bill Gates</td>
  </tr>
  <tr>
    <th>Telephone:</th>
    <td>555 77 854</td>
  </tr>
  <tr>
    <th>Telephone:</th>
    <td>555 77 855</td>
  </tr>
</table>

My current function:

def is_vertical_table(table):
    # Check if table is vertical and return true.

My initial thought where to check if all th tags are inside first tr tag but that doesn't seem as a perfect solution as some tags may be inside multiple tbody tags etc...

CodePudding user response:

You can use pandas.read_html to convert to DataFrame, then use a custom function to compare the numbers of rows and columns:

html = '''<h2>Horizontal Headings:</h2>
<table style="width:100%">
  <tr>
    <th>Name</th>
    <th>Telephone</th>
    <th>Telephone</th>
  </tr>
  <tr>
    <td>Bill Gates</td>
    <td>555 77 854</td>
    <td>555 77 855</td>
  </tr>
</table>

<h2>Vertical Headings:</h2>
<table style="width:100%">
  <tr>
    <th>Name:</th>
    <td>Bill Gates</td>
  </tr>
  <tr>
    <th>Telephone:</th>
    <td>555 77 854</td>
  </tr>
  <tr>
    <th>Telephone:</th>
    <td>555 77 855</td>
  </tr>
</table>
'''

def wide_or_long(df):
    if df.shape[1] > df.shape[0]:
        return('wide')
    if df.shape[0] > df.shape[1]:
        return('long')
    return 'square'

# checking first table
wide_or_long(pd.read_html(html)[0])
# wide

# checking second table
wide_or_long(pd.read_html(html)[1])
# long

Alternative function based on the presence of a column header:

def wide_or_long(df):
    return 'long' if list(df) == list(range(df.shape[1])) else 'wide'
  • Related