I am working on extracting data about all countries from Wikipedia. I want my parser to be generic enough to work for all countries. Here is an example: let's say I am extracting now GDP (PPP) from all countries. In Wikipedia, they are placed inside an infoBox table. The problem is that GDP(PPP) is split among 3 different rows in the table. This is the structure:
<th scope="row" class="infobox-label">
<a href="/wiki/Gross_domestic_product" title="Gross domestic product">GDP</a> 
<style data-mw-deduplicate="TemplateStyles:r886047488">.mw-parser-output .nobold{font-weight:normal}</style>
<span class="nobold">(<a href="/wiki/Purchasing_power_parity" title="Purchasing power parity">PPP</a>)</span>
</th>
<td class="infobox-data">2020 estimate</td>
</tr>
<tr class="mergedrow">
<th scope="row" class="infobox-label">
<div class="ib-country-fake-li">• Total</div>
</th>
<td class="infobox-data"><img alt="Increase" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" decoding="async" title="Increase" width="11" height="11" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" data-file-width="300" data-file-height="300" /> $1.391 trillion<sup id="cite_ref-IMFWEOEG_10-0" class="reference"><a href="#cite_note-IMFWEOEG-10">[10]</a></sup> (<a href="/wiki/List_of_countries_by_GDP_(PPP)" title="List of countries by GDP (PPP)">20th</a>)</td>
</tr>
<tr class="mergedbottomrow">
<th scope="row" class="infobox-label">
<div class="ib-country-fake-li">• Per capita</div>
</th>
<td class="infobox-data"><img alt="Increase" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" decoding="async" title="Increase" width="11" height="11" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" data-file-width="300" data-file-height="300" /> $14,023<sup id="cite_ref-IMFWEOEG_10-1" class="reference"><a href="#cite_note-IMFWEOEG-10">[10]</a></sup> (<a href="/wiki/List_of_countries_by_GDP_(PPP)_per_capita" title="List of countries by GDP (PPP) per capita">92nd</a>)</td>
</tr>
here is what I tried so far:
site= "http://en.wikipedia.org/wiki/Brazil"
country = requests.get(site)
countryPage = BeautifulSoup(country.content, "html.parser")
infoBox = countryPage.find("table", class_="infobox ib-country vcard")
#find GDP PPP
tds = infoBox.select('th:-soup-contains("PPP") tr')
print(tds)
the problem is that this code prints the row of the GDP PPP itself and not the one after it despite using the ' tr' as a CSS selector. can anyone tell me what I did wrong? how to select the table row after the one I find using CSS selector?
on a side note, if you can suggest an easier extraction mechanism that I what I am currently doing, it would be very useful
CodePudding user response:
i don't think you can achieve what you want with css selectors. one way or another you'll have to store the rows or get the indices of the rows. if you transform the select
result into a generator you can use next
trs = (tr for tr in soup.select('tr'))
for tr in trs:
if 'PPP' in tr.text:
print(next(trs).text)
print(next(trs).text)
>>> • Total $3.328 trillion[8] (8th)
>>> • Per capita $15,642[8] (84th)
CodePudding user response:
To select the next sibling <tr>
you can go with:
soup.select_one('tr:has(th:-soup-contains("PPP"))~tr')
Or you want both of them:
soup.select('tr:has(th:-soup-contains("PPP"))~tr')[:2]
To get the texts:
[x.text for x in soup.select('tr:has(th:-soup-contains("PPP"))~tr')[:2]]