I have a HTML page as shown below, I am trying to scrape the date from the table on about line 30.
<HTML>
<HEAD>
<TITLE>
HGV|DAF|
</TITLE>
<LINK rel="stylesheet" type="text/css" href="report.css" >
</HEAD>
<BODY BGCOLOR="WHITE" TEXT="BLACK" >
<TABLE BORDER="0" WIDTH="100%" CELLSPACING="0">
<TR>
<TD WIDTH="100%" colspan=" 2"><font face="Courier New" size="2" COLOR="BLACK">Commercial</font></TD>
</TR>
<TR>
<TD WIDTH="90%" ><font face="Courier New" size="2" COLOR="BLACK">DETAILED BRAKE TEST RESULT - FULL TEST</font></TD>
<TD WIDTH="10%" ALIGN=RIGHT><font face="Courier New" size="2" COLOR="BLACK">483</font></TD>
</TR>
</TABLE>
<HR SIZE="1" WIDTH="100%" COLOR="BLACK">
<TABLE BORDER="0" WIDTH="100%" CELLSPACING="0">
<TR>
<TD WIDTH="100%" colspan=" 2"><font face="Courier New" size="2" COLOR="BLACK"><BR></font></TD>
</TR>
</TABLE>
<TABLE BORDER="0" WIDTH="100%" CELLSPACING="0">
<TR>
<TD WIDTH="16%" ><font face="Courier New" size="2" COLOR="BLACK">DTp Number</font></TD>
<TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
<TD WIDTH="11%" ><font face="Courier New" size="2" COLOR="BLACK">6136M</font></TD>
<TD WIDTH="22%" ><font face="Courier New" size="2" COLOR="BLACK">TYPE APPROVED</font></TD>
<TD WIDTH="8%" ><font face="Courier New" size="2" COLOR="BLACK">Loc.</font></TD>
<TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
<TD WIDTH="39%" colspan=" 4"><font face="Courier New" size="2" COLOR="BLACK">Barnoldswick</font></TD>
</TR>
<TR>
<TD WIDTH="16%" ><font face="Courier New" size="2" COLOR="BLACK">Vehicle Make</font></TD>
<TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
<TD WIDTH="33%" colspan=" 2"><font face="Courier New" size="2" COLOR="BLACK">DAF</font></TD>
<TD WIDTH="8%" ><font face="Courier New" size="2" COLOR="BLACK">Date</font></TD>
<TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
<TD WIDTH="18%" ><font face="Courier New" size="2" COLOR="BLACK">Wed 03/08/2022</font></TD>
<TD WIDTH="6%" ><font face="Courier New" size="2" COLOR="BLACK">Time</font></TD>
<TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
<TD WIDTH="13%" ><font face="Courier New" size="2" COLOR="BLACK">14:11</font></TD>
</TR>
<TR>
<TD WIDTH="16%" ><font face="Courier New" size="2" COLOR="BLACK">Vehicle Type</font></TD>
<TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
<TD WIDTH="33%" colspan=" 2"><font face="Courier New" size="2" COLOR="BLACK">4 AXLE RIGID HGV</font></TD>
<TD WIDTH="8%" ><font face="Courier New" size="2" COLOR="BLACK">GVW</font></TD>
<TD WIDTH="2%" ><font face="Courier New" size="2" COLOR="BLACK">:</font></TD>
<TD WIDTH="18%" ><font face="Courier New" size="2" COLOR="BLACK">32000kg</font></TD>
<TD WIDTH="21%" colspan=" 3"><font face="Courier New" size="2" COLOR="BLACK"> </font></TD>
</TR>
I have been trying with beautiful soup but because there is no tag or unique attribute I have been unsuccessful.
If anyone can point me in the right direction I'd very much appreciate it.
CodePudding user response:
Let's assume that your HTML is in a variable called html then:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(*[td_.text for td_ in soup.find_all('td')], sep='\n')
...will print the text part of all TDs.
How you decide which ones are relevant is up to you
CodePudding user response:
Try the next example
from bs4 import BeautifulSoup
html = '''
<html>
<head>
<title>
HGV|DAF|
</title>
<link href="report.css" rel="stylesheet" type="text/css"/>
</head>
<body bgcolor="WHITE" text="BLACK">
<table border="0" cellspacing="0" width="100%">
<tr>
<td colspan=" 2" width="100%">
<font color="BLACK" face="Courier New" size="2">
Commercial
</font>
</td>
</tr>
<tr>
<td width="90%">
<font color="BLACK" face="Courier New" size="2">
DETAILED BRAKE TEST RESULT - FULL TEST
</font>
</td>
<td align="RIGHT" width="10%">
<font color="BLACK" face="Courier New" size="2">
483
</font>
</td>
</tr>
</table>
<hr color="BLACK" size="1" width="100%"/>
<table border="0" cellspacing="0" width="100%">
<tr>
<td colspan=" 2" width="100%">
<font color="BLACK" face="Courier New" size="2">
<br/>
</font>
</td>
</tr>
</table>
<table border="0" cellspacing="0" width="100%">
<tr>
<td width="16%">
<font color="BLACK" face="Courier New" size="2">
DTp Number
</font>
</td>
<td width="2%">
<font color="BLACK" face="Courier New" size="2">
:
</font>
</td>
<td width="11%">
<font color="BLACK" face="Courier New" size="2">
6136M
</font>
</td>
<td width="22%">
<font color="BLACK" face="Courier New" size="2">
TYPE APPROVED
</font>
</td>
<td width="8%">
<font color="BLACK" face="Courier New" size="2">
Loc.
</font>
</td>
<td width="2%">
<font color="BLACK" face="Courier New" size="2">
:
</font>
</td>
<td colspan=" 4" width="39%">
<font color="BLACK" face="Courier New" size="2">
Barnoldswick
</font>
</td>
</tr>
<tr>
<td width="16%">
<font color="BLACK" face="Courier New" size="2">
Vehicle Make
</font>
</td>
<td width="2%">
<font color="BLACK" face="Courier New" size="2">
:
</font>
</td>
<td colspan=" 2" width="33%">
<font color="BLACK" face="Courier New" size="2">
DAF
</font>
</td>
<td width="8%">
<font color="BLACK" face="Courier New" size="2">
Date
</font>
</td>
<td width="2%">
<font color="BLACK" face="Courier New" size="2">
:
</font>
</td>
<td width="18%">
<font color="BLACK" face="Courier New" size="2">
Wed 03/08/2022
</font>
</td>
<td width="6%">
<font color="BLACK" face="Courier New" size="2">
Time
</font>
</td>
<td width="2%">
<font color="BLACK" face="Courier New" size="2">
:
</font>
</td>
<td width="13%">
<font color="BLACK" face="Courier New" size="2">
14:11
</font>
</td>
</tr>
<tr>
<td width="16%">
<font color="BLACK" face="Courier New" size="2">
Vehicle Type
</font>
</td>
<td width="2%">
<font color="BLACK" face="Courier New" size="2">
:
</font>
</td>
<td colspan=" 2" width="33%">
<font color="BLACK" face="Courier New" size="2">
4 AXLE RIGID HGV
</font>
</td>
<td width="8%">
<font color="BLACK" face="Courier New" size="2">
GVW
</font>
</td>
<td width="2%">
<font color="BLACK" face="Courier New" size="2">
:
</font>
</td>
<td width="18%">
<font color="BLACK" face="Courier New" size="2">
32000kg
</font>
</td>
<td colspan=" 3" width="21%">
<font color="BLACK" face="Courier New" size="2">
&nbsp
</font>
</td>
</tr>
</table>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
for table in soup.select('table')[2:4]:
date = [tr.get_text(strip=True) for tr in table.select('tr td')][12]
print(date)
Output:
Wed 03/08/2022