I am trying to identify tags in an html document based on part of the attribute value.
I'm interested in any 'a' under 'tr' tag as long as it starts or has :
"AnnoncesImmobilier.asp?rech_cod_pay=" in the href attribute.
HTML source :
<tr bgcolor="#f1efe2" valign="middle">
<td bgcolor="#294a73" height="20"><img alt="" height="1" src="/images/space.gif" width="1"/></td>
<td> <a href="AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10117&rech_cod_loc=1011701" onm ouseover="return escape('<b>Gouvernorat</b> : Tunis<br/><b>Délégation</b> : La Marsa<br/><b>Localité</b> : Berge Du Lac');">Berge Du Lac</a> </td>
<td bgcolor="#294a73"><img alt="" height="1" src="/images/space.gif" width="1"/></td>
<td onm ouseover="return escape('<b>Rubrique</b> : Offres<br/><b>Nature</b> : Terrain<br/><b>Type</b> : Terrain nu');" style="CURSOR:pointer;"> Terrain</td>
will give ad_title = "Berge Du Lac"
In the source HTML , each "tr" tag with class "Tableau1" contains an ad with different tr , a , tags for title, price, description etc...
Below is my code :
import re
from bs4 import BeautifulSoup
# The URL to get data from
URL = 'http://www.tunisie-annonce.com/AnnoncesImmobilier.asp'
data = requests.get(URL)
soup = BeautifulSoup(data.content, "html.parser")
# Variable to extract the ads
ads = soup.find_all("tr", {"class":"Tableau1"})
for ad in ads:
ad_title = ads.find(text=re.compile('AnnoncesImmobilier.asp?rech_cod_pay=')).parent.get_text())
print(title)
ad_title = ads.find(text=re.compile('AnnoncesImmobilier.asp?rech_cod_pay=')).parent.get_text()) is the last snippet that I tried to retrieve the text, but neither this or previous code worked for me.
How can i proceed ?
CodePudding user response:
You do not need a regex here, you can use
titles = []
for ad in ads:
links = ad.find_all('a', href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h)
for link in links:
titles.append(link.get_text())
print(titles)
If you want to get a unique list of titles, use a set:
titles = set()
for ad in ads:
links = ad.find_all('a', href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h)
for link in links:
titles.add(link.get_text())
In both cases, href=lambda h: h and "AnnoncesImmobilier.asp?rech_cod_pay=" in h
makes sure there is a href
attribute and it contains a AnnoncesImmobilier.asp?rech_cod_pay=
string.
CodePudding user response:
I'm interested in any 'a' under 'tr' tag as long as it starts or has : "AnnoncesImmobilier.asp?rech_cod_pay=" in the href attribute.
You can make your selection more specific with css selectors
:
soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')
To get a list of all the href texts just iterat the result set:
[row.a.text for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')]
Using set()
you can filter the list to unique values:
set([row.a.text for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])')])
Output
{'Hammam Lif', 'El Manar 2', 'El Menzah 8', 'Chotrana 1', 'Rades', 'Sousse Corniche', 'Cite De La Sant', 'Sousse', 'Bizerte', 'Ain Zaghouan', 'Hammamet', 'La Soukra', 'Riadh Landlous', 'El Menzah 5', 'Khezama Ouest', 'Montplaisir', 'Sousse Khezama', 'Hergla', 'El Ouerdia', 'Hammam Sousse', 'El Menzah 1', 'Cite Ennasr 2', 'Bab El Khadra'}
To extract more than just the href text you can do the following:
data = []
for row in soup.select('tr.Tableau1:has(a[href*="AnnoncesImmobilier.asp?rech_cod_pay="])'):
d = list(row.stripped_strings)
d.append(row.a['href'])
data.append(d)
pd.DataFrame(data)
Output
Région | Nature | Type | Texte annonce | Prix | Modifiée | Link |
---|---|---|---|---|---|---|
Sousse Corniche | Location | App. 3 pièc | Magnifique appartement s2 fac | 1 000 | 08/02/2022 | AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12114&rech_cod_loc=1211413 |
Riadh Landlous | Location | App. 4 pièc | S3 situé au 1ér étage à riadh | 850 | 08/02/2022 | AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10201&rech_cod_loc=1020135 |
Khezama Ouest | Vente | App. 4 pièc | Magnifique s3 khzema pré | 250 000 | 08/02/2022 | AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12112&rech_cod_loc=1211209 |
El Menzah 8 | Location | App. 1 pièc | Studio meublé manzah 8 vv | 600 | 08/02/2022 | AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=10201&rech_cod_loc=1020126 |
Hergla | Vente | App. 3 pièc | Appartement s 2 vue mer | 300 000 | 08/02/2022 | AnnoncesImmobilier.asp?rech_cod_pay=TN&rech_cod_vil=12105&rech_cod_loc=1210502 |
... |