Home > Net >  Scraping data from HTML table using Python
Scraping data from HTML table using Python

Time:09-07

I have a HTML page as shown below, I am trying to scrape the date from the table on about line 30.

    <HTML>
 <HEAD>
  <TITLE>
HGV|DAF|
  </TITLE>
  <LINK rel="stylesheet" type="text/css" href="report.css" >
 </HEAD>
<BODY BGCOLOR="WHITE" TEXT="BLACK" >
  <TABLE BORDER="0" WIDTH="100%" CELLSPACING="0">
   <TR>
     <TD WIDTH="100%" colspan=" 2"><font face="Courier New" size="2" COLOR="BLACK">Commercial</font></TD>
   </TR>
   <TR>
     <TD WIDTH="90%" ><font face="Courier New" size="2" COLOR="BLACK">DETAILED BRAKE TEST RESULT - FULL TEST</font></TD>
     <TD WIDTH="10%" ALIGN=RIGHT><font face="Courier New" size="2" COLOR="BLACK">483</font></TD>
   </TR>
  </TABLE>
<HR SIZE="1" WIDTH="100%" COLOR="BLACK">
  <TABLE BORDER="0" WIDTH="100%" CELLSPACING="0">
   <TR>
     <TD WIDTH="100%" colspan=" 2"><font face="Courier New" size="2" COLOR="BLACK"><BR></font></TD>
   </TR>
  </TABLE>
  <TABLE BORDER="0" WIDTH="100%" CELLSPACING="0">
   <TR>
     <TD WIDTH="16%" ><font face="Courier New" size="2" COLOR="BLACK">DTp Number</font></TD>
     <TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
     <TD WIDTH="11%" ><font face="Courier New" size="2" COLOR="BLACK">6136M</font></TD>
     <TD WIDTH="22%" ><font face="Courier New" size="2" COLOR="BLACK">TYPE APPROVED</font></TD>
     <TD WIDTH="8%" ><font face="Courier New" size="2" COLOR="BLACK">Loc.</font></TD>
     <TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
     <TD WIDTH="39%" colspan=" 4"><font face="Courier New" size="2" COLOR="BLACK">Barnoldswick</font></TD>
   </TR>
   <TR>
     <TD WIDTH="16%" ><font face="Courier New" size="2" COLOR="BLACK">Vehicle Make</font></TD>
     <TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
     <TD WIDTH="33%" colspan=" 2"><font face="Courier New" size="2" COLOR="BLACK">DAF</font></TD>
     <TD WIDTH="8%" ><font face="Courier New" size="2" COLOR="BLACK">Date</font></TD>
     <TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
     <TD WIDTH="18%" ><font face="Courier New" size="2" COLOR="BLACK">Wed 03/08/2022</font></TD>
     <TD WIDTH="6%" ><font face="Courier New" size="2" COLOR="BLACK">Time</font></TD>
     <TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
     <TD WIDTH="13%" ><font face="Courier New" size="2" COLOR="BLACK">14:11</font></TD>
   </TR>
   <TR>
     <TD WIDTH="16%" ><font face="Courier New" size="2" COLOR="BLACK">Vehicle Type</font></TD>
     <TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
     <TD WIDTH="33%" colspan=" 2"><font face="Courier New" size="2" COLOR="BLACK">4 AXLE RIGID HGV</font></TD>
     <TD WIDTH="8%" ><font face="Courier New" size="2" COLOR="BLACK">GVW</font></TD>
     <TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
     <TD WIDTH="18%" ><font face="Courier New" size="2" COLOR="BLACK">32000kg</font></TD>
     <TD WIDTH="21%" colspan=" 3"><font face="Courier New" size="2" COLOR="BLACK">&nbsp</font></TD>
   </TR>

I have been trying with beautiful soup but because there is no tag or unique attribute I have been unsuccessful.

If anyone can point me in the right direction I'd very much appreciate it.

CodePudding user response:

Let's assume that your HTML is in a variable called html then:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(*[td_.text for td_ in soup.find_all('td')], sep='\n')

...will print the text part of all TDs.

How you decide which ones are relevant is up to you

CodePudding user response:

Try the next example

from bs4 import BeautifulSoup

html = '''
<html>
 <head>
  <title>
   HGV|DAF|
  </title>
  <link href="report.css" rel="stylesheet" type="text/css"/>
 </head>
 <body bgcolor="WHITE" text="BLACK">
  <table border="0" cellspacing="0" width="100%">
   <tr>
    <td colspan=" 2" width="100%">
     <font color="BLACK" face="Courier New" size="2">
      Commercial
     </font>
    </td>
   </tr>
   <tr>
    <td width="90%">
     <font color="BLACK" face="Courier New" size="2">
      DETAILED BRAKE TEST RESULT - FULL TEST
     </font>
    </td>
    <td align="RIGHT" width="10%">
     <font color="BLACK" face="Courier New" size="2">
      483
     </font>
    </td>
   </tr>
  </table>
  <hr color="BLACK" size="1" width="100%"/>
  <table border="0" cellspacing="0" width="100%">
   <tr>
    <td colspan=" 2" width="100%">
     <font color="BLACK" face="Courier New" size="2">
      <br/>
     </font>
    </td>
   </tr>
  </table>
  <table border="0" cellspacing="0" width="100%">
   <tr>
    <td width="16%">
     <font color="BLACK" face="Courier New" size="2">
      DTp Number
     </font>
    </td>
    <td width="2%">
     <font color="BLACK" face="Courier New" size="2">
      :
     </font>
    </td>
    <td width="11%">
     <font color="BLACK" face="Courier New" size="2">
      6136M
     </font>
    </td>
    <td width="22%">
     <font color="BLACK" face="Courier New" size="2">
      TYPE APPROVED
     </font>
    </td>
    <td width="8%">
     <font color="BLACK" face="Courier New" size="2">
      Loc.
     </font>
    </td>
    <td width="2%">
     <font color="BLACK" face="Courier New" size="2">
      :
     </font>
    </td>
    <td colspan=" 4" width="39%">
     <font color="BLACK" face="Courier New" size="2">
      Barnoldswick
     </font>
    </td>
   </tr>
   <tr>
    <td width="16%">
     <font color="BLACK" face="Courier New" size="2">
      Vehicle Make
     </font>
    </td>
    <td width="2%">
     <font color="BLACK" face="Courier New" size="2">
      :
     </font>
    </td>
    <td colspan=" 2" width="33%">
     <font color="BLACK" face="Courier New" size="2">
      DAF
     </font>
    </td>
    <td width="8%">
     <font color="BLACK" face="Courier New" size="2">
      Date
     </font>
    </td>
    <td width="2%">
     <font color="BLACK" face="Courier New" size="2">
      :
     </font>
    </td>
    <td width="18%">
     <font color="BLACK" face="Courier New" size="2">
      Wed 03/08/2022
     </font>
    </td>
    <td width="6%">
     <font color="BLACK" face="Courier New" size="2">
      Time
     </font>
    </td>
    <td width="2%">
     <font color="BLACK" face="Courier New" size="2">
      :
     </font>
    </td>
    <td width="13%">
     <font color="BLACK" face="Courier New" size="2">
      14:11
     </font>
    </td>
   </tr>
   <tr>
    <td width="16%">
     <font color="BLACK" face="Courier New" size="2">
      Vehicle Type
     </font>
    </td>
    <td width="2%">
     <font color="BLACK" face="Courier New" size="2">
      :
     </font>
    </td>
    <td colspan=" 2" width="33%">
     <font color="BLACK" face="Courier New" size="2">
      4 AXLE RIGID HGV
     </font>
    </td>
    <td width="8%">
     <font color="BLACK" face="Courier New" size="2">
      GVW
     </font>
    </td>
    <td width="2%">
     <font color="BLACK" face="Courier New" size="2">
      :
     </font>
    </td>
    <td width="18%">
     <font color="BLACK" face="Courier New" size="2">
      32000kg
     </font>
    </td>
    <td colspan=" 3" width="21%">
     <font color="BLACK" face="Courier New" size="2">
      &amp;nbsp
     </font>
    </td>
   </tr>
  </table>
 </body>
</html>

'''
soup = BeautifulSoup(html, 'lxml')

for table in soup.select('table')[2:4]:
    date = [tr.get_text(strip=True) for tr in table.select('tr td')][12]
    print(date)

Output:

Wed 03/08/2022
  • Related