Intention I am working on extracting data about all countries from Wikipedia. I want my parser to be generic enough to work for all countries.
Let's say I am extracting now GDP (PPP) from all countries. In Wikipedia, they are placed inside an infoBox table. The problem is that GDP(PPP) is split among 3 different rows in the table.
This is the structure:
<th scope="row" class="infobox-label">
<a href="/wiki/Gross_domestic_product" title="Gross domestic product">GDP</a> 
<style data-mw-deduplicate="TemplateStyles:r886047488">.mw-parser-output .nobold{font-weight:normal}</style>
<span class="nobold">(<a href="/wiki/Purchasing_power_parity" title="Purchasing power parity">PPP</a>)</span>
</th>
<td class="infobox-data">2020 estimate</td>
</tr>
<tr class="mergedrow">
<th scope="row" class="infobox-label">
<div class="ib-country-fake-li">• Total</div>
</th>
<td class="infobox-data"><img alt="Increase" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" decoding="async" title="Increase" width="11" height="11" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" data-file-width="300" data-file-height="300" /> $1.391 trillion<sup id="cite_ref-IMFWEOEG_10-0" class="reference"><a href="#cite_note-IMFWEOEG-10">[10]</a></sup> (<a href="/wiki/List_of_countries_by_GDP_(PPP)" title="List of countries by GDP (PPP)">20th</a>)</td>
</tr>
<tr class="mergedbottomrow">
<th scope="row" class="infobox-label">
<div class="ib-country-fake-li">• Per capita</div>
</th>
<td class="infobox-data"><img alt="Increase" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" decoding="async" title="Increase" width="11" height="11" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" data-file-width="300" data-file-height="300" /> $14,023<sup id="cite_ref-IMFWEOEG_10-1" class="reference"><a href="#cite_note-IMFWEOEG-10">[10]</a></sup> (<a href="/wiki/List_of_countries_by_GDP_(PPP)_per_capita" title="List of countries by GDP (PPP) per capita">92nd</a>)</td>
</tr>
Here is what I tried so far:
site= "http://en.wikipedia.org/wiki/Brazil"
country = requests.get(site)
countryPage = BeautifulSoup(country.content, "html.parser")
infoBox = countryPage.find("table", class_="infobox ib-country vcard")
#find GDP PPP
tds = infoBox.select('th:-soup-contains("PPP") tr')
print(tds)
The problem That code prints the row of the GDP PPP itself and not the one after it despite using the ' tr' as a CSS selector.
Can anyone tell me what I did wrong? How to select the table row after the one I find using CSS selector?
CodePudding user response:
To select the next sibling <tr>
you can go with:
soup.select_one('tr:has(th:-soup-contains("PPP"))~tr')
Or you want both of them:
soup.select('tr:has(th:-soup-contains("PPP"))~tr')[:2]
To get the texts:
[x.text for x in soup.select('tr:has(th:-soup-contains("PPP"))~tr')[:2]]
CodePudding user response:
i don't think you can achieve what you want with css selectors. one way or another you'll have to store the rows or get the indices of the rows. if you transform the select
result into a generator you can use next
trs = (tr for tr in soup.select('tr'))
for tr in trs:
if 'PPP' in tr.text:
print(next(trs).text)
print(next(trs).text)
>>> • Total $3.328 trillion[8] (8th)
>>> • Per capita $15,642[8] (84th)