I'm pretty new parsing HTML documents and I'm stuck in this problem.
Giving an HTML document made like this:
<h3>File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMMainThread.h</h3>
<table width="100%">
<h4>Function: ::OMMainThread::destroyThread()</h4>
<table width="100%">
<tr><td align="left">Metric</td><td align="right">CALLS (STCAL)</td><td align="right">v(G) (STCYC)</td><td align="right">GOTO (STGTO)</td><td align="right">RETURN (STM19)</td><td align="right">LEVEL (STMIF)</td><td align="right">PARAM (STPAR)</td><td align="right">PATH (STPTH)</td><td align="right">STMT (STST3)</td></tr>
<tr><td align="left">Values</td><td align="right">1</td><td align="right">1</td><td align="right">0</td><td align="right">0</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">1</td></tr>
</table>
<h3>File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMNullValue.h</h3>
<table width="100%">
<h4>Function: ::OMNullValue<p{c::Ping}>::get()</h4>
<table width="100%">
<tr><td align="left">Metric</td><td align="right">CALLS (STCAL)</td><td align="right">v(G) (STCYC)</td><td align="right">GOTO (STGTO)</td><td align="right">RETURN (STM19)</td><td align="right">LEVEL (STMIF)</td><td align="right">PARAM (STPAR)</td><td align="right">PATH (STPTH)</td><td align="right">STMT (STST3)</td></tr>
<tr><td align="left">Values</td><td align="right">1</td><td align="right">1</td><td align="right">0</td><td align="right">1</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">2</td></tr>
</table>
<h4>Function: ::OMNullValue<p{c::Ping}>::initNullBlock()</h4>
<table width="100%">
<tr><td align="left">Metric</td><td align="right">CALLS (STCAL)</td><td align="right">v(G) (STCYC)</td><td align="right">GOTO (STGTO)</td><td align="right">RETURN (STM19)</td><td align="right">LEVEL (STMIF)</td><td align="right">PARAM (STPAR)</td><td align="right">PATH (STPTH)</td><td align="right">STMT (STST3)</td></tr>
<tr><td align="left">Values</td><td align="right">0</td><td align="right">2</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">0</td><td align="right">2</td><td align="right">5</td></tr>
</table>
<h4>Function: ::OMNullValue<p{c::Pong}>::get()</h4>
<table width="100%">
<tr><td align="left">Metric</td><td align="right">CALLS (STCAL)</td><td align="right">v(G) (STCYC)</td><td align="right">GOTO (STGTO)</td><td align="right">RETURN (STM19)</td><td align="right">LEVEL (STMIF)</td><td align="right">PARAM (STPAR)</td><td align="right">PATH (STPTH)</td><td align="right">STMT (STST3)</td></tr>
<tr><td align="left">Values</td><td align="right">1</td><td align="right">1</td><td align="right">0</td><td align="right">1</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">2</td></tr>
</table>
<h4>Function: ::OMNullValue<p{c::Pong}>::initNullBlock()</h4>
<table width="100%">
<tr><td align="left">Metric</td><td align="right">CALLS (STCAL)</td><td align="right">v(G) (STCYC)</td><td align="right">GOTO (STGTO)</td><td align="right">RETURN (STM19)</td><td align="right">LEVEL (STMIF)</td><td align="right">PARAM (STPAR)</td><td align="right">PATH (STPTH)</td><td align="right">STMT (STST3)</td></tr>
<tr><td align="left">Values</td><td align="right">0</td><td align="right">2</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">0</td><td align="right">2</td><td align="right">5</td></tr>
</table>
<h3>File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMStaticArray.h</h3>
<table width="100%">
<h4>Function: ::OMStaticArray<p{c::Ping}>::@constructor(,ni)</h4>
<table width="100%">
<tr><td align="left">Metric</td><td align="right">CALLS (STCAL)</td><td align="right">v(G) (STCYC)</td><td align="right">GOTO (STGTO)</td><td align="right">RETURN (STM19)</td><td align="right">LEVEL (STMIF)</td><td align="right">PARAM (STPAR)</td><td align="right">PATH (STPTH)</td><td align="right">STMT (STST3)</td></tr>
<tr><td align="left">Values</td><td align="right">4</td><td align="right">2</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">1</td><td align="right">2</td><td align="right">2</td></tr>
</table>
what I need is to create a data structure made like this:
<Filename, function (related to that file), STCYC value of that function>
I tried iterating like this:
for files_and_functions in soup.find_all(['h3','h4','table']):
for elem in files_and_functions:
valore = elem.text
and asking for each elem if it's a function, a file or a STCYC value, but I can't get out of it. Is there anyone who can obtain these information from this terrible HTML? Thank you very much!
CodePudding user response:
you can try using this
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)
CodePudding user response:
If html_doc
contains the HTML snippet from your question you can do:
soup = BeautifulSoup(html_doc, "html.parser")
for t in soup.select("table.metricstable:not(:has(table))"):
k = [td.text for td in t.tr.find_all("td")]
v = [td.text for td in t.tr.find_next("tr").find_all("td")]
d = dict(zip(k, v))
filename = t.find_previous("h3").text
function = t.find_previous("h4").text
styc = d["v(G) (STCYC)"]
print("{:<50} {:<10} {}".format(function, styc, filename))
Prints:
Function: ::OMMainThread::destroyThread() 1 File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMMainThread.h
Function: ::OMNullValue::get() 1 File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMNullValue.h
Function: ::OMNullValue::initNullBlock() 2 File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMNullValue.h
Function: ::OMNullValue::get() 1 File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMNullValue.h
Function: ::OMNullValue::initNullBlock() 2 File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMNullValue.h
Function: ::OMStaticArray::@constructor(,ni) 2 File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMStaticArray.h