This question must be a duplicate, but for the sake of it, I can't find it anywhere.
html = """
<html>
<head>
</head>
<body>
<div id="7471292"></div>
<div id="5235252"></div>
<div href="/some/link/"></div>
<div id="7567327"></div>
<div id="1231312"></div>
<div </div>
<div id="2342424"></div>
</body>
</html>
"""
#Create soup from html
soup = BeautifulSoup(html)
I want the following output:
[<div id="7471292"></div>,
<div id="5235252"></div>,
<div id="7567327"></div>,
<div id="1231312"></div>,
<div id="2342424"></div>]
We can do something like:
soup.find_all("div")
but this will return all div
s. If we want to specify an id
attractor, we have to fill in a concise value as well, seemingly rendering it useless:
soup.find_all('div', {'id': ""})
CodePudding user response:
You can pass in a lambda function that checks whether the id
contains only contains numbers. A regular expression is overkill here.
soup = BeautifulSoup(html)
print(soup.find_all("div", id=lambda x: x is not None and x.isnumeric()))
This outputs:
[<div id="7471292"></div>, <div id="5235252"></div>,
<div id="7567327"></div>, <div id="1231312"></div>, <div id="2342424"></div>]
CodePudding user response:
What you need is a combination of regex and soup:
from bs4 import BeautifulSoup
import re
html = """
<html>
<head>
</head>
<body>
<div id="7471292"></div>
<div id="5235252"></div>
<div href="/some/link/"></div>
<div id="7567327"></div>
<div id="1231312"></div>
<div </div>
<div id="2342424"></div>
</body>
</html>
"""
soup = BeautifulSoup(html)
soup.find_all('div', {'id': re.compile("\d ")})
Output
[<div id="7471292"></div>,
<div id="5235252"></div>,
<div id="7567327"></div>,
<div id="1231312"></div>,
<div id="2342424"></div>]
If you are interested in having the div
tags whose id contains number, letters or combination of both, instead of using (\d )
try using ([\d\w] )
.