I would like to scrape name, address informations between tag contains defendent text and another tag,
My HTML structure is:
<hr>
<H5>Defendant/Respondent Information</H5>
<span >(Each Defendant/Respondent is displayed below)</span>
<table>
<tr>
<td><span >Party Type:</span></td><td><span >Defendant</span><span >Party No.:</span><span >1</span></td>
</tr>
</table>
<table>
<tr>
<td><span >Name:</span></td><td><span >Burrell, Marvin</span></td>
</tr>
</table>
<table>
<tr>
<td><span >Address:</span></td><td><span >33 N Ann St</span></td>
</tr>
<tr>
<td><span >City:</span></td><td><span >Baltimore</span><span >State:</span><span >MD</span><span >Zip Code:</span><span >21231</span></td>
</tr>
</table>
<hr>
<table>
<tr>
<td><span >Party Type:</span></td><td><span >Defendant</span><span >Party No.:</span><span >2</span></td>
</tr>
</table>
<table>
<tr>
<td><span >Name:</span></td><td><span >Burrell, Frances Ann</span></td>
</tr>
</table>
<table>
<tr>
<td><span >Address:</span></td><td><span >33 N Ann St</span></td>
</tr>
<tr>
<td><span >City:</span></td><td><span >Baltimore</span><span >State:</span><span >MD</span><span >Zip Code:</span><span >21231</span></td>
</tr>
</table>
<hr>
<H5>Related Persons Information</H5>
<span >(Each Related person is displayed below)</span>
<table>
<tr>
<td><span >Name:</span></td><td><span >Unwanted Name</span></td>
</tr>
</table>
<table>
<tr>
<td><span >Address:</span></td><td><span >33 N Ann St</span></td>
</tr>
<tr>
<td><span >City:</span></td><td><span >Unwanted City</span><span >State:</span><span >Unwanted city</span><span >Zip Code:</span><span >12345</span></td>
</tr>
</table>
<table></table>
<hr>
My current XPATH capturing the first occurence of Name and address properly, but if need to extract the multiple occurences, it also scrape the information from the unwanted h5 tags.
My current XPATH is,
"//*[contains(text(),'Defendant')]//following-sibling::table//span[text()='Name:' or text()='Business or Organization Name:']/ancestor-or-self::td/following-sibling::td//text()")
I tried including preceding sibling and following sibling but nothing gives my expected output,
My current output is..
names - [
Burrell, Marvin,
Burrell, Frances Ann,
Unwanted Name,
]
Expected output is,
[
Burrell, Marvin,
Burrell, Frances Ann,
]
Kindly help.
CodePudding user response:
try this:
"//H5[contains(text(),'Defendant')]/following-sibling::table[not(preceding-sibling::H5[not(contains(text(),'Defendant'))])]/tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"
It first selects the table that has not
a preceding-sibling::h5
with text()
that not
contains
'Defendant'
and than
selects from the correct table the tr
where the first td
meets your requirements and selects the second td
No need for double slashes which is bad for performance
EDIT
Since there are more preceding-sibling::h5 than the example shows, this XPath
will deal with that:
"//H5[contains(text(),'Defendant')]/following-sibling::table[preceding-sibling::H5[1][contains(text(),'Defendant')]]//tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"
This will only select those tables that have as there first preceding-sibling::h5 the same h5 as we were interested in