I am using BeautifulSoup
to extract text of a tag. When I print, I get these three blocks, each belongs to a tag.
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
That tag returns at least 2 strings which if I append them to a list
, will look like this:
[<class 'str'>,
<class 'str'>,
<class 'str'>,
<class 'str'>,
<class 'str'>,
<class 'str'>,
<class 'str'>,
<class 'str'>]
Instead, I want to have each block be in a tuple where only two of the strings are needed. Like this:
[(<class 'str'>, <class 'str'>),
(<class 'str'>, <class 'str'>),
(<class 'str'>, <class 'str'>)]
I tried the following, but I struggled:
e = []
for element in soup.select('div.ph0.pv2.artdeco-card.mb2 li'):
names = element.select('div.entity-result__content.entity-result__divider.pt3.pb3.t-12.t-black--light')[0].select('a')
if len(names) > 1: ## filter those tags <a></a> with extra information
for c in names:
print(c.get_text(separator=" ", strip=True))
Update:
Here, I give a real example. The following is the output of the above code snippet.
LexisNexis
3 people from your school were hired here
82 jobs
AbacusNext
1 person from your school was hired here
24 jobs
Aderant
38 jobs
The expected output I am trying to solve is something like:
[('LexisNexis', '82'),
('AbacusNext', '24'),
('Aderant', '38')]
The HTML page I am working on can be publicly found here.
To separate numbers from text, I used
import re
regex = re.compile('[^0-9]')
regex.sub('', <not sure what should be here based on my desired output>)
CodePudding user response:
Try to find a good pattern and select your elements more specific:
data=[]
for e in soup.select('a[href*="jobs/search"]'):
data.append((
e.find_previous('span',{'class':'entity-result__title-text'}).get_text(strip=True),
e.get_text(strip=True).split()[0]
))
data
or
data=[]
for e in soup.select('.reusable-search__result-container:has(a[href*="jobs/search"])'):
data.append((
e.select_one('span.entity-result__title-text').get_text(strip=True),
e.select_one('a[href*="jobs/search"]').get_text(strip=True).split()[0]
))
data
Output:
[('LexisNexis', '82'),
('AbacusNext', '24'),
('Aderant', '38'),
('Anaqua', '3'),
('Thomson Reuters Elite', '1'),
('Litify', '6')]