I need bring only the continent (North America) using Wikipedia by URL (in the code below, I will replace the country, in this case, "Guatemala", and make it be a parameter in power BI), but I am getting the whole <a
tag. How can I do that?
import requests as rq
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import re
url = 'https://en.wikipedia.org/wiki/Geography_of_Guatemala'
page = rq.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
res = soup.find_all('td', class_='infobox-data')
df = pd.DataFrame(res)
df = df.to_numpy()
df = str(df[0])
print(df)
print(re.search('\">(.*?)\</a>', df).group(1))
This is the data frame:
[<td ><a href="/wiki/North_America" title="North America">North America</a></td>]
and this is the re.search
:
<a href="/wiki/North_America" title="North America">North America
CodePudding user response:
I don't know if it's the best solution, but it works and follow some logic. For example I imagine you want to change the country and take the Continent.
So I basically iterate over all the results in your find_all element and append only the text values in a new list and called the 0 element (the first one):
url = 'https://en.wikipedia.org/wiki/Geography_of_Guatemala'
page = rq.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
res = soup.find_all('td', class_='infobox-data')
info_list = []
for i in res:
info_list.append(i.text)
info_list[0]
Alternatively you can use only the function find from BeautifulSoup, once you just want the first value
res = soup.find('td', class_='infobox-data')
res.text
CodePudding user response:
To get name of continent you can use for example CSS selector:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Geography_of_Guatemala"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
continent = soup.select_one("th:-soup-contains(Continent) td").text
print(continent)
Prints:
North America
th:-soup-contains(Continent) td
Selects <td>
tag which has previous tag <th>
containing text "Continent"