I have following problem scrapping site. I have a 3700 pages with person email and I need to achive them. The problem is that they do not contain any class name and Xpath can be different for different pages beacuse sometimes there are phone number before email and it breaks everything. I try to use a different solutions with selenium, but it doesn`t work. Can you please give me some advices of how to deal with this and how I can scrape them. Below is some examples of pages where different structure of html is presented. Thanks!
<div>
<div><i style="margin-right: 0.5rem;"></i><span style="unicode-bidi: bidi-override; direction: rtl;"> moc.ttobbanaej@naej</span></div>
<div><a href="http://JeanAbbott.com" target="_blank" rel="noopener noreferrer" style="overflow-wrap: normal; text-overflow: ellipsis; overflow: hidden;">JeanAbbott.com</a></div>
<div id="contactInfoWrap" style="margin-top: 10px;">
<div>Jean Abbott</div>
<div>
<div>5 Colonial Circle</div>
<div>Medicine Lake, MN 55441</div>
<div>US</div>
</div>
</div>
</div>
And another one
<div>
<div><i style="margin-right: 0.5rem;"></i>202-800-7057</div>
<div><i style="margin-right: 0.5rem;"></i><span style="unicode-bidi: bidi-override; direction: rtl;"> moc.tsiugnilde@ahsynal</span></div>
<div><a href="http://edlinguist.com/" target="_blank" rel="noopener noreferrer" style="overflow-wrap: normal; text-overflow: ellipsis; overflow: hidden;">edlinguist.com/</a></div>
<div id="contactInfoWrap" style="margin-top: 10px;">
<div>LaNysha Adams</div>
<div>
<div>80 M St SE</div>
<div>1st Floor</div>
<div>Washington, DC 20003</div>
<div>US</div>
</div>
</div>
</div>
The element that I need looks like this
<span style="unicode-bidi: bidi-override; direction: rtl;"> moc.ttobbanaej@naej</span>
CodePudding user response:
//div[contains(.,"@")]/span
The above xpath expression will select your desired html portion:
<span style="unicode-bidi: bidi-override; direction: rtl;"> moc.tsiugnilde@ahsynal</span>
and the desired text node value is : moc.tsiugnilde@ahsynal
CodePudding user response:
It seems like the email-addresses are mirrored. And to address that there is style info: unicode-bidi: bidi-override; direction: rtl;
meaning that moc.tsiugnilde@ahsynal
is [email protected]
.
And so it is maybe better to just use this XPath:
//span[style='unicode-bidi: bidi-override; direction: rtl;']