Home > Enterprise >  How to Extract the Table from the HTML Code using Python BeatifulSoup
How to Extract the Table from the HTML Code using Python BeatifulSoup

Time:10-18

I am trying to extract the table from a HTML code using Python BeatifulSoup

{"html":"<table width=\"100%\" cellspacing=\"0\" cellpadding=\"0\" class=\"gr_table_b1\">\n  <tbody>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\" class=\"gr_table_colm10\">Bond Type<\/td>\n      <td valign=\"top\" align=\"right\">\n         US Corporate Debentures\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Debt Type<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t Senior Unsecured Note\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Industry Group<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t  Industrial\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Industry Sub Group<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t Transportation\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\" class=\"gr_table_colm10\">Sub-Product Asset<\/td>\n      <td valign=\"top\" align=\"right\">\n         CORP\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Sub-Product Asset Type<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t Corporate Bond\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">State<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t  &mdash;\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Use of Proceeds<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t &mdash;\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Security Code<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t &mdash;\n      <\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n<div class=\"gr_row_b6 gr_table_title\">Special Characteristics<\/div>\n<div class=\"gr_section_b1\">\n  <table width=\"100%\" cellspacing=\"0\" cellpadding=\"0\" class=\"gr_table_b1\">\n   <tbody>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Medium Term Note<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t  N\n      <\/td>\n    <\/tr>\n   <\/tbody>\n  <\/table>\n <\/div>"}

And my desired outcome will be:

| Bond Type              | US Corporate Debentures  |
| Debt Type              | Senior Unsecured Note    |
| Industry Group         | Industrial               |
| Industry Sub Group     | Transportation           |
| Sub-Product Asset      | CORP                     |
| Sub-Product Asset Type | Corporate Bond           |
| State                  | —                        |
| Use of Proceeds        | —                        |
| Security Code          | —                        |
|                        |                          |

CodePudding user response:

You can use the tabulate library to print the formatted output.

from bs4 import BeautifulSoup
from tabulate import tabulate

html_doc = """<table width="100%" cellspacing="0" cellpadding="0" >
  <tbody>
    <tr >
      <td valign="top" >Bond Type</td>
      <td valign="top" align="right">
         US Corporate Debentures
      </td>
    </tr>
    <tr >
      <td valign="top">Debt Type</td>
      <td valign="top" align="right">
         Senior Unsecured Note
      </td>
    </tr>
    <tr >
      <td valign="top">Industry Group</td>
      <td valign="top" align="right">
          Industrial
      </td>
    </tr>
    <tr >
      <td valign="top">Industry Sub Group</td>
      <td valign="top" align="right">
         Transportation
      </td>
    </tr>
    <tr >
      <td valign="top" >Sub-Product Asset</td>
      <td valign="top" align="right">
         CORP
      </td>
    </tr>
    <tr >
      <td valign="top">Sub-Product Asset Type</td>
      <td valign="top" align="right">
         Corporate Bond
      </td>
    </tr>
    <tr >
      <td valign="top">State</td>
      <td valign="top" align="right">
          &mdash;
      </td>
    </tr>
    <tr >
      <td valign="top">Use of Proceeds</td>
      <td valign="top" align="right">
         &mdash;
      </td>
    </tr>
    <tr >
      <td valign="top">Security Code</td>
      <td valign="top" align="right">
         &mdash;
      </td>
    </tr>
  </tbody>
</table>
<div >Special Characteristics</div>
<div >
  <table width="100%" cellspacing="0" cellpadding="0" >
   <tbody>
    <tr >
      <td valign="top">Medium Term Note</td>
      <td valign="top" align="right">
          N
      </td>
    </tr>
   </tbody>
  </table>
 </div>"""
soup = BeautifulSoup(html_doc, "html.parser")
soup = BeautifulSoup(html_doc, "html.parser")
values = []
for table in soup.find_all("table"):
    for row in table.find_all("tr"):
        values.append([])
        for column in row.find_all("td"):
            values[-1].append(column.text.strip())
print(tabulate(values))
----------------------  -----------------------
Bond Type               US Corporate Debentures
Debt Type               Senior Unsecured Note
Industry Group          Industrial
Industry Sub Group      Transportation
Sub-Product Asset       CORP
Sub-Product Asset Type  Corporate Bond
State                   —
Use of Proceeds         —
Security Code           —
Medium Term Note        N
----------------------  -----------------------

CodePudding user response:

Assuming you already have located / extracted the JSON, simply use pandas.read_html() to parse the table:

pd.read_html(json_data['html'])[0]

Example

import pandas as pd
json_data = {"html":"<table width=\"100%\" cellspacing=\"0\" cellpadding=\"0\" class=\"gr_table_b1\">\n  <tbody>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\" class=\"gr_table_colm10\">Bond Type<\/td>\n      <td valign=\"top\" align=\"right\">\n         US Corporate Debentures\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Debt Type<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t Senior Unsecured Note\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Industry Group<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t  Industrial\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Industry Sub Group<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t Transportation\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\" class=\"gr_table_colm10\">Sub-Product Asset<\/td>\n      <td valign=\"top\" align=\"right\">\n         CORP\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Sub-Product Asset Type<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t Corporate Bond\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">State<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t  &mdash;\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Use of Proceeds<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t &mdash;\n      <\/td>\n    <\/tr>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Security Code<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t &mdash;\n      <\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n<div class=\"gr_row_b6 gr_table_title\">Special Characteristics<\/div>\n<div class=\"gr_section_b1\">\n  <table width=\"100%\" cellspacing=\"0\" cellpadding=\"0\" class=\"gr_table_b1\">\n   <tbody>\n    <tr class=\"gr_table_row4\">\n      <td valign=\"top\">Medium Term Note<\/td>\n      <td valign=\"top\" align=\"right\">\n      \t  N\n      <\/td>\n    <\/tr>\n   <\/tbody>\n  <\/table>\n <\/div>"}

pd.read_html(json_data['html'])[0]

Output

0 1
0 Bond Type US Corporate Debentures
1 Debt Type Senior Unsecured Note
2 Industry Group Industrial
3 Industry Sub Group Transportation
4 Sub-Product Asset CORP
5 Sub-Product Asset Type Corporate Bond
6 State
7 Use of Proceeds
8 Security Code — Special Characteristics Medium Term Note N
9 Medium Term Note N
  • Related