I'm scrapping a webpage that uploads different documents and I want to retrieve some information from this documents. At first I hard coded the scrapper to search the information on a certain xpath, but now I see that this might change depending on the document. Is there any way to get the text from an element that contains a substring?
Here's an example:
I want to get the company name, the HTML were it appears follows this:
<div id="fullDocument">
<div >
<div id="docHeader">...</div>
<ul id="docToc">...</ul>
<div >...</div>
<div id="DocumentBody">
<div >...</div>
<div >...</div>
<div >...</div>
<div >...</div>
<div >
<p >...</p>
<div >
<span ></span>
<span ></span>
<div >
"Official name: Company Name"
<br>
"Identification: xxxxxx"
<br>
"Postal code: 00000"
<br>
"City: city"
</div>
</div>
</div>
</div>
</div>
</div>
For this example, I hardcoded into my script the following code:
from lxml import etree
class LTED:
def __init__(self, url, soup):
if(not soup)
soup = get_soup_from_url(url, "html.parser")
dom = etree.HTML(str(soup))
self.organization = self.get_organization(dom)
def get_organization(self, dom):
item = dom.xpath("/div[@id='fullDocument']/div/div[3]/div[5]/div/div")[0].text
return item.split(": ")[1]
This actually works for the example, but as I mentioned the problem is that the xpath might change depending on the document, for example, instead of "/div[@id='fullDocument']/div/div[3]/div[5]/div/div"
might change to "/div[@id='fullDocument']/div/div[3]/div[6]/div/div"
or something similar.
Trying to solve this I searched on the Internet and found this, but didn't work for me:
item = soup.find_all("div", string="Official name:")
I expected this to return a list with all elements containing the substring "Official name:"
but it gave me an empty list []
.
Is there any way to get the element containing the substring so independently of the xpath I can always get the Company Name and any other information I might need?
CodePudding user response:
I expected this to return a list with all elements containing the substring "Official name:" but it gave me an empty list [].
That is because it needs an exact match, but you could use re.compile
:
import re
soup.find_all(text = re.compile('Official name:'))
However, why not using an alternative approach (selecting by class) that will give you a structured output?
For a single one:
dict(i.strip('"').split(': ') for i in soup.select_one('#DocumentBody div.txtmark').stripped_strings)
### leads to
{'Official name': 'Company Name',
'Identification': 'xxxxxx',
'Postal code': '00000',
'City': 'city'}
or for multiple in your document:
[dict(i.strip('"').split(': ') for i in list(e.stripped_strings)) for e in soup.select('div.txtmark')]
### leads to
[{'Official name': 'Company Name',
'Identification': 'xxxxxx',
'Postal code': '00000',
'City': 'city'},
{'Official name': 'Company Name B',
'Identification': 'xxxxxx',
'Postal code': '00000',
'City': 'city'}]
Example
from bs4 import BeautifulSoup
html='''
<div id="fullDocument">
<div >
<div id="docHeader">...</div>
<ul id="docToc">...</ul>
<div >...</div>
<div id="DocumentBody">
<div >...</div>
<div >...</div>
<div >...</div>
<div >...</div>
<div >
<p >...</p>
<div >
<span ></span>
<span ></span>
<div >
"Official name: Company Name"
<br>
"Identification: xxxxxx"
<br>
"Postal code: 00000"
<br>
"City: city"
</div>
</div>
</div>
</div>
</div>
</div>
'''
soup = BeautifulSoup(html)
dict(i.strip('"').split(': ') for i in soup.select_one('#DocumentBody div.txtmark').stripped_strings)