I have extract some infos in https://www.peakbagger.com/list.aspx?lid=5651 under the first column https://www.peakbagger.com/peak.aspx?pid=10882
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.peakbagger.com/peak.aspx?pid=10882'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
a= soup.select("td")
a
would want to only retrieve the latitude and longitude which is 35.360638, 138.727347 from the output I get.
It is in
E<br/>35.360638, 138.727347
And also, is there a better way to retrieve all the latitude and longitude from the links of /www.peakbagger.com/list.aspx?lid=5651 other than doing one by one?
Thanks
CodePudding user response:
This answer is very specific to your problem. The issue here is with the br tags occurring inside td tags. The etree module (part of lxml) allows you to access text following a tag (a.k.a. its tail). This code will print the values you've shown as being your desired output.
import requests
from lxml import etree
with requests.Session() as session:
r = session.get('https://www.peakbagger.com/peak.aspx?pid=10882')
r.raise_for_status()
tree = etree.HTML(r.text)
print(' '.join(tree.xpath('//table[@class="gray"][1]/*//br')[1].tail.split()[:2]))
CodePudding user response:
As Brutus mentioned, it is very specific and if you wont to use etree
this could be an alternativ.
find()
with the stringLatitude/Longitude (WGS84)
findNext()
its next- grab its contents
- replace the , and split it by the whitespace
- by slicing the result to the first two elements you will get the list with lat and long.
.
data = soup.find('td', string='Latitude/Longitude (WGS84)')\
.findNext('td')\
.contents[2]\
.replace(',','')\
.split()[:2]
data
EDIT
You have a list of urls and loop over it - To be considerate of the website and not to be banned, run over the pages with some delay (time.sleep()
).
import time
import requests
from bs4 import BeautifulSoup
urls = ['https://www.peakbagger.com/peak.aspx?pid=10882',
'https://www.peakbagger.com/peak.aspx?pid=10866',
'https://www.peakbagger.com/peak.aspx?pid=10840',
'https://www.peakbagger.com/peak.aspx?pid=10868',
'https://www.peakbagger.com/peak.aspx?pid=10832']
data = {}
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
ll= soup.find('td', string='Latitude/Longitude (WGS84)')\
.findNext('td')\
.contents[2]\
.replace(',','')\
.split()[:2]
data[soup.select_one('h1').get_text()]={
'url':url,
'lat':ll[0],
'long':ll[1]
}
time.sleep(3)
data
Output
{'Fuji-san, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10882',
'lat': '35.360638',
'long': '138.727347'},
'Kita-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10866',
'lat': '35.674537',
'long': '138.238833'},
'Hotaka-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10840',
'lat': '36.289203',
'long': '137.647986'},
'Aino-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10868',
'lat': '35.646037',
'long': '138.228292'},
'Yariga-take, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10832',
'lat': '36.34198',
'long': '137.647625'}}