Home > Software design >  How to extract string value in html with parsing in python
How to extract string value in html with parsing in python

Time:12-18

I am trying to get the string value for each link. (For example, like Pennsylvania)

 <li >
    <a href="/bill/116th-congress/house-bill/9043/cosponsors?r=1&amp;s=1&amp;q={"search":["H.R.9043","H.R.9043"],"cosponsor-state":"Pennsylvania"}" title="include this search constraint" id="facetItemcosponsor-statePennsylvania">
        Pennsylvania        <span id="facetItemcosponsor-statePennsylvaniacount" >[1]</span>    </a>
</li>
   </a> 

But since there are title and id attributes, I am a bit confused about how to do it. I get a null result when I display my array. Here is my code :

  for link in links_array:

    main_url_link = base_url_link   link
    html_page_link = requests.get(main_url_link)
    soup_link = BeautifulSoup(html_page_link.text, 'html.parser')
    allData_link = soup_link.findAll('li',{'class':'facetbox-shownrow'})
  
    distric = [y.text_content() for y in allData_link]
    district_array.append(distric)


district_array 

CodePudding user response:

Use .stripped_strings to generate a list of strings of elements in your selection and pick / slice the result - In this case pick first element to get Pennsylvania:

[list(x.stripped_strings)[0] for x in soup.find_all('li',{'class':'facetbox-shownrow'})]

Note In new code find_all() should be used, findAll() actually still works but is very old syntax

To get the href:

[x.a['href'] for x in soup.find_all('li',{'class':'facetbox-shownrow'})]

Example

With multiple li tags:

from bs4 import BeautifulSoup

html="""
<li >
    <a href="/bill/116th-congress/house-bill/9043/cosponsors?r=1&amp;s=1&amp;q={"search":["H.R.9043","H.R.9043"],"cosponsor-state":"Pennsylvania"}" title="include this search constraint" id="facetItemcosponsor-statePennsylvania">
        Pennsylvania        <span id="facetItemcosponsor-statePennsylvaniacount" >[1]</span>    </a>
</li>
<li >
    <a href="/bill/116th-congress/house-bill/9043/cosponsors?r=1&amp;s=1&amp;q={"search":["H.R.9043","H.R.9043"],"cosponsor-state":"Pennsylvania"}" title="include this search constraint" id="facetItemcosponsor-statePennsylvania">
        Main        <span id="facetItemcosponsor-statePennsylvaniacount" >[1]</span>    </a>
</li>
<li >
    <a href="/bill/116th-congress/house-bill/9043/cosponsors?r=1&amp;s=1&amp;q={"search":["H.R.9043","H.R.9043"],"cosponsor-state":"Pennsylvania"}" title="include this search constraint" id="facetItemcosponsor-statePennsylvania">
        California        <span id="facetItemcosponsor-statePennsylvaniacount" >[1]</span>    </a>
</li>
"""
soup=BeautifulSoup(html,"html.parser")

[list(x.stripped_strings)[0] for x in soup.find_all('li',{'class':'facetbox-shownrow'})]

Output

['Pennsylvania', 'Main', 'California']
  • Related