Is there a way to use bs4 to search for multiple attribute types with the same value?
I am scraping meta tags from news articles in order to get information like the title, author, and data published. There is some variation in how this data is structured between sites, and I would like to use the most compact code possible to cover the known possibilites.
For example the title could be in any of:
<meta content="Title of the article" property="og:title"/>
<meta content="Title of the article" property="title"/>
<meta name="Title of the article" property="og:title"/>
<meta name="Title of the article" property="title"/>
I can do something like this:
try:
soup.find('meta', {'property' : re.compile('title')})['content']
except:
soup.find('name', {'property' : re.compile('title')})['content']
But it would be nice if I could do something like this:
## No result returned
soup.find('meta', {re.compile('property|name') : re.compile('title')})
## TypeError: unhashable type: 'list'
soup.find('meta', {['property','name'] : re.compile('title')})
Is there something along these lines that would work?
CodePudding user response:
As far as I understand, you want to find more than 1 object with the same folder name in the html code.
content_metas = soup.find_all("meta", {"content": "Title of the article"})
name_metas = soup.find_all("meta", {"name": "Title of the article"})
CodePudding user response:
Main challenge is that attribute naming can vary, so there should be a check against the valid values of a list ['name','title','content','...']
, which can be outsourced to a separate function.
Selecting only the <meta>
with property containing title I go with css selectors
:
soup.select_one('meta[property*="title"]')
Pushing the element into a function and iterate over its attributes, while checking if they match the possible names:
def get_title(e):
for a in e.attrs:
if a in ['name','content']:
return e.get(a)
title = get_title(soup.select_one('meta[property*="title"]'))
The following example should illustrate how the whole thing could also be implemented on the basis of a list comprehension. Since the news page will probably only contain one of the combinations, the result would be a list with exactly one or no element, depending on whether the attribute is present or not.
from bs4 import BeautifulSoup
html='''
<meta content="Title of the article" property="og:title"/>
<meta content="Title of the article" property="title"/>
<meta name="Title of the article" property="og:title"/>
<meta name="Title of the article" property="title"/>
<meta title="Title of the article" property="title"/>
'''
soup = BeautifulSoup(html)
[t.get(a) for t in soup.select('meta[property*="title"]') for a in t.attrs if a in ['name','title','content']]