Home > OS >  Beautiful Soup Find href text based on partial attribute value
Beautiful Soup Find href text based on partial attribute value

Time:02-08

I am trying to identify tags in an html document based on part of the attribute value.

I'm interested in any 'a' under 'tr' tag as long as it starts or has :

"AnnoncesImmobilier.asp?rech_cod_pay=" in the href attribute.

HTML source :

<tr bgcolor="#f1efe2"  valign="middle">
   <td bgcolor="#294a73" height="20"><img alt="" height="1" src="/images/space.gif" width="1"/></td>
   <td>  <a href="AnnoncesImmobilier.asp?rech_cod_pay=TN&amp;rech_cod_vil=10117&amp;rech_cod_loc=1011701" onm ouseover="return escape('&lt;b&gt;Gouvernorat&lt;/b&gt; : Tunis&lt;br/&gt;&lt;b&gt;Délégation&lt;/b&gt; : La Marsa&lt;br/&gt;&lt;b&gt;Localité&lt;/b&gt; : Berge Du Lac');">Berge Du Lac</a> </td>
   <td bgcolor="#294a73"><img alt="" height="1" src="/images/space.gif" width="1"/></td>
   <td onm ouseover="return escape('&lt;b&gt;Rubrique&lt;/b&gt; : Offres&lt;br/&gt;&lt;b&gt;Nature&lt;/b&gt; : Terrain&lt;br/&gt;&lt;b&gt;Type&lt;/b&gt; : Terrain nu');" style="CURSOR:pointer;">  Terrain</td>

will give ad_title = "Berge Du Lac"

In the source HTML , each "tr" tag with class "Tableau1" contains an ad with different tr , a , tags for title, price, description etc...

Below is my code :

import re
from bs4 import BeautifulSoup

# The URL to get data from
URL = 'http://www.tunisie-annonce.com/AnnoncesImmobilier.asp'

data = requests.get(URL)

soup = BeautifulSoup(data.content, "html.parser")

# Variable to extract the ads
ads = soup.find_all("tr", {"class":"Tableau1"})

for ad in ads:
    ad_title = ads.find(text=re.compile('AnnoncesImmobilier.asp?rech_cod_pay=')).parent.get_text())
print(title)

ad_title = ads.find(text=re.compile('AnnoncesImmobilier.asp?rech_cod_pay=')).parent.get_text()) is the last snippet that I tried to retrieve the text, but neither this or previous code worked for me.

How can i proceed ?

CodePudding user response:

You do not need a regex here, you can use

titles = []
for ad in ads:
    links = ad.find_all('a', href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h)
    for link in links:
        titles.append(link.get_text())

print(titles)

If you want to get a unique list of titles, use a set:

titles = set()
for ad in ads:
    links = ad.find_all('a', href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h)
    for link in links:
        titles.add(link.get_text())

In both cases, href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h makes sure there is a href attribute and it contains a AnnoncesImmobilier.asp?rech_cod_pay= string.

CodePudding user response:

I'm interested in any 'a' under 'tr' tag as long as it starts or has : "AnnoncesImmobilier.asp?rech_cod_pay=" in the href attribute.

You can make your selection more specific with css selectors:

soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')

To get a list of all the href texts just iterat the result set:

[row.a.text for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')]

Using set() you can filter the list to unique values:

set([row.a.text for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')])

Output

{'Hammam Lif', 'El Manar 2', 'El Menzah 8', 'Chotrana 1', 'Rades', 'Sousse Corniche', 'Cite De La Sant', 'Sousse', 'Bizerte', 'Ain Zaghouan', 'Hammamet', 'La Soukra', 'Riadh Landlous', 'El Menzah 5', 'Khezama Ouest', 'Montplaisir', 'Sousse Khezama', 'Hergla', 'El Ouerdia', 'Hammam Sousse', 'El Menzah 1', 'Cite Ennasr 2', 'Bab El Khadra'}

To extract more than just the href text you can do the following:

data = []
for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])'):
    d = list(row.stripped_strings)
    d.append(row.a['href'])
    data.append(d)
pd.DataFrame(data)

Output

Région Nature Type Texte annonce Prix Modifiée Link
Sousse Corniche Location App. 3 pièc Magnifique appartement s2 fac 1 000 08/02/2022 AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12114&rech_cod_loc=1211413
Riadh Landlous Location App. 4 pièc S3 situé au 1ér étage à riadh 850 08/02/2022 AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10201&rech_cod_loc=1020135
Khezama Ouest Vente App. 4 pièc Magnifique s3 khzema pré 250 000 08/02/2022 AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12112&rech_cod_loc=1211209
El Menzah 8 Location App. 1 pièc Studio meublé manzah 8 vv 600 08/02/2022 AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10201&rech_cod_loc=1020126
Hergla Vente App. 3 pièc Appartement s 2 vue mer 300 000 08/02/2022 AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12105&rech_cod_loc=1210502
...
  •  Tags:  
  • Related