Home > Net >  Web Scraping Wikipedia
Web Scraping Wikipedia

Time:10-11

I need bring only the continent (North America) using Wikipedia by URL (in the code below, I will replace the country, in this case, "Guatemala", and make it be a parameter in power BI), but I am getting the whole <a tag. How can I do that?

import requests as rq
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import re

url = 'https://en.wikipedia.org/wiki/Geography_of_Guatemala'
page = rq.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
res = soup.find_all('td', class_='infobox-data')
df = pd.DataFrame(res)
df = df.to_numpy()
df = str(df[0])
print(df)
print(re.search('\">(.*?)\</a>', df).group(1))

This is the data frame:

[<td ><a href="/wiki/North_America" title="North America">North America</a></td>]

and this is the re.search:

<a href="/wiki/North_America" title="North America">North America

CodePudding user response:

I don't know if it's the best solution, but it works and follow some logic. For example I imagine you want to change the country and take the Continent.

So I basically iterate over all the results in your find_all element and append only the text values in a new list and called the 0 element (the first one):

url = 'https://en.wikipedia.org/wiki/Geography_of_Guatemala'
page = rq.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
res = soup.find_all('td', class_='infobox-data')
info_list = []
for i in res:
    info_list.append(i.text)
info_list[0]

Alternatively you can use only the function find from BeautifulSoup, once you just want the first value

res = soup.find('td', class_='infobox-data')
res.text

CodePudding user response:

To get name of continent you can use for example CSS selector:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Geography_of_Guatemala"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

continent = soup.select_one("th:-soup-contains(Continent)   td").text
print(continent)

Prints:

North America

th:-soup-contains(Continent)   td

Selects <td> tag which has previous tag <th> containing text "Continent"

  • Related