I am trying to extract the table from a HTML code using Python BeatifulSoup
{"html":"<table width=\"100%\" cellspacing=\"0\" cellpadding=\"0\" class=\"gr_table_b1\">\n <tbody>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\" class=\"gr_table_colm10\">Bond Type<\/td>\n <td valign=\"top\" align=\"right\">\n US Corporate Debentures\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Debt Type<\/td>\n <td valign=\"top\" align=\"right\">\n \t Senior Unsecured Note\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Industry Group<\/td>\n <td valign=\"top\" align=\"right\">\n \t Industrial\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Industry Sub Group<\/td>\n <td valign=\"top\" align=\"right\">\n \t Transportation\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\" class=\"gr_table_colm10\">Sub-Product Asset<\/td>\n <td valign=\"top\" align=\"right\">\n CORP\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Sub-Product Asset Type<\/td>\n <td valign=\"top\" align=\"right\">\n \t Corporate Bond\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">State<\/td>\n <td valign=\"top\" align=\"right\">\n \t —\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Use of Proceeds<\/td>\n <td valign=\"top\" align=\"right\">\n \t —\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Security Code<\/td>\n <td valign=\"top\" align=\"right\">\n \t —\n <\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<div class=\"gr_row_b6 gr_table_title\">Special Characteristics<\/div>\n<div class=\"gr_section_b1\">\n <table width=\"100%\" cellspacing=\"0\" cellpadding=\"0\" class=\"gr_table_b1\">\n <tbody>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Medium Term Note<\/td>\n <td valign=\"top\" align=\"right\">\n \t N\n <\/td>\n <\/tr>\n <\/tbody>\n <\/table>\n <\/div>"}
And my desired outcome will be:
| Bond Type | US Corporate Debentures |
| Debt Type | Senior Unsecured Note |
| Industry Group | Industrial |
| Industry Sub Group | Transportation |
| Sub-Product Asset | CORP |
| Sub-Product Asset Type | Corporate Bond |
| State | — |
| Use of Proceeds | — |
| Security Code | — |
| | |
CodePudding user response:
You can use the tabulate library to print the formatted output.
from bs4 import BeautifulSoup
from tabulate import tabulate
html_doc = """<table width="100%" cellspacing="0" cellpadding="0" >
<tbody>
<tr >
<td valign="top" >Bond Type</td>
<td valign="top" align="right">
US Corporate Debentures
</td>
</tr>
<tr >
<td valign="top">Debt Type</td>
<td valign="top" align="right">
Senior Unsecured Note
</td>
</tr>
<tr >
<td valign="top">Industry Group</td>
<td valign="top" align="right">
Industrial
</td>
</tr>
<tr >
<td valign="top">Industry Sub Group</td>
<td valign="top" align="right">
Transportation
</td>
</tr>
<tr >
<td valign="top" >Sub-Product Asset</td>
<td valign="top" align="right">
CORP
</td>
</tr>
<tr >
<td valign="top">Sub-Product Asset Type</td>
<td valign="top" align="right">
Corporate Bond
</td>
</tr>
<tr >
<td valign="top">State</td>
<td valign="top" align="right">
—
</td>
</tr>
<tr >
<td valign="top">Use of Proceeds</td>
<td valign="top" align="right">
—
</td>
</tr>
<tr >
<td valign="top">Security Code</td>
<td valign="top" align="right">
—
</td>
</tr>
</tbody>
</table>
<div >Special Characteristics</div>
<div >
<table width="100%" cellspacing="0" cellpadding="0" >
<tbody>
<tr >
<td valign="top">Medium Term Note</td>
<td valign="top" align="right">
N
</td>
</tr>
</tbody>
</table>
</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
soup = BeautifulSoup(html_doc, "html.parser")
values = []
for table in soup.find_all("table"):
for row in table.find_all("tr"):
values.append([])
for column in row.find_all("td"):
values[-1].append(column.text.strip())
print(tabulate(values))
---------------------- -----------------------
Bond Type US Corporate Debentures
Debt Type Senior Unsecured Note
Industry Group Industrial
Industry Sub Group Transportation
Sub-Product Asset CORP
Sub-Product Asset Type Corporate Bond
State —
Use of Proceeds —
Security Code —
Medium Term Note N
---------------------- -----------------------
CodePudding user response:
Assuming you already have located / extracted the JSON, simply use pandas.read_html()
to parse the table:
pd.read_html(json_data['html'])[0]
Example
import pandas as pd
json_data = {"html":"<table width=\"100%\" cellspacing=\"0\" cellpadding=\"0\" class=\"gr_table_b1\">\n <tbody>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\" class=\"gr_table_colm10\">Bond Type<\/td>\n <td valign=\"top\" align=\"right\">\n US Corporate Debentures\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Debt Type<\/td>\n <td valign=\"top\" align=\"right\">\n \t Senior Unsecured Note\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Industry Group<\/td>\n <td valign=\"top\" align=\"right\">\n \t Industrial\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Industry Sub Group<\/td>\n <td valign=\"top\" align=\"right\">\n \t Transportation\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\" class=\"gr_table_colm10\">Sub-Product Asset<\/td>\n <td valign=\"top\" align=\"right\">\n CORP\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Sub-Product Asset Type<\/td>\n <td valign=\"top\" align=\"right\">\n \t Corporate Bond\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">State<\/td>\n <td valign=\"top\" align=\"right\">\n \t —\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Use of Proceeds<\/td>\n <td valign=\"top\" align=\"right\">\n \t —\n <\/td>\n <\/tr>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Security Code<\/td>\n <td valign=\"top\" align=\"right\">\n \t —\n <\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<div class=\"gr_row_b6 gr_table_title\">Special Characteristics<\/div>\n<div class=\"gr_section_b1\">\n <table width=\"100%\" cellspacing=\"0\" cellpadding=\"0\" class=\"gr_table_b1\">\n <tbody>\n <tr class=\"gr_table_row4\">\n <td valign=\"top\">Medium Term Note<\/td>\n <td valign=\"top\" align=\"right\">\n \t N\n <\/td>\n <\/tr>\n <\/tbody>\n <\/table>\n <\/div>"}
pd.read_html(json_data['html'])[0]
Output
0 | 1 | |
---|---|---|
0 | Bond Type | US Corporate Debentures |
1 | Debt Type | Senior Unsecured Note |
2 | Industry Group | Industrial |
3 | Industry Sub Group | Transportation |
4 | Sub-Product Asset | CORP |
5 | Sub-Product Asset Type | Corporate Bond |
6 | State | — |
7 | Use of Proceeds | — |
8 | Security Code | — Special Characteristics Medium Term Note N |
9 | Medium Term Note | N |