Home > Software engineering >  Group texts of an HTML tag in list of tuples
Group texts of an HTML tag in list of tuples

Time:08-19

I am using BeautifulSoup to extract text of a tag. When I print, I get these three blocks, each belongs to a tag.

<class 'str'>
<class 'str'>
<class 'str'>


<class 'str'>
<class 'str'>
<class 'str'>

<class 'str'>
<class 'str'>

That tag returns at least 2 strings which if I append them to a list, will look like this:

[<class 'str'>, 
 <class 'str'>, 
 <class 'str'>, 
 <class 'str'>, 
 <class 'str'>, 
 <class 'str'>, 
 <class 'str'>, 
 <class 'str'>]

Instead, I want to have each block be in a tuple where only two of the strings are needed. Like this:

[(<class 'str'>, <class 'str'>), 
 (<class 'str'>, <class 'str'>), 
 (<class 'str'>, <class 'str'>)]

I tried the following, but I struggled:

e = []
for element in soup.select('div.ph0.pv2.artdeco-card.mb2 li'):
    names = element.select('div.entity-result__content.entity-result__divider.pt3.pb3.t-12.t-black--light')[0].select('a')
    if len(names) > 1:  ## filter those tags <a></a> with extra information     
        for c in names:
            print(c.get_text(separator=" ", strip=True))

Update:

Here, I give a real example. The following is the output of the above code snippet.

LexisNexis
3 people from your school were hired here
82 jobs

AbacusNext
1 person from your school was hired here
24 jobs

Aderant
38 jobs

The expected output I am trying to solve is something like:

[('LexisNexis', '82'), 
 ('AbacusNext', '24'), 
 ('Aderant', '38')]

The HTML page I am working on can be publicly found here.

To separate numbers from text, I used

import re
regex = re.compile('[^0-9]')
regex.sub('', <not sure what should be here based on my desired output>)

CodePudding user response:

Try to find a good pattern and select your elements more specific:

data=[]
for e in soup.select('a[href*="jobs/search"]'):
    data.append((
        e.find_previous('span',{'class':'entity-result__title-text'}).get_text(strip=True),
        e.get_text(strip=True).split()[0]
    ))
data 

or

data=[]
for e in soup.select('.reusable-search__result-container:has(a[href*="jobs/search"])'):
    data.append((
        e.select_one('span.entity-result__title-text').get_text(strip=True),
        e.select_one('a[href*="jobs/search"]').get_text(strip=True).split()[0]
    ))
data

Output:

[('LexisNexis', '82'),
 ('AbacusNext', '24'),
 ('Aderant', '38'),
 ('Anaqua', '3'),
 ('Thomson Reuters Elite', '1'),
 ('Litify', '6')]
  • Related