Home > Back-end >  How to extract latitude and longitude in a web
How to extract latitude and longitude in a web

Time:09-29

I have extract some infos in https://www.peakbagger.com/list.aspx?lid=5651 under the first column https://www.peakbagger.com/peak.aspx?pid=10882


from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.peakbagger.com/peak.aspx?pid=10882'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

a= soup.select("td")
a

would want to only retrieve the latitude and longitude which is 35.360638, 138.727347 from the output I get.

It is in

E<br/>35.360638, 138.727347 

And also, is there a better way to retrieve all the latitude and longitude from the links of /www.peakbagger.com/list.aspx?lid=5651 other than doing one by one?

Thanks

CodePudding user response:

This answer is very specific to your problem. The issue here is with the br tags occurring inside td tags. The etree module (part of lxml) allows you to access text following a tag (a.k.a. its tail). This code will print the values you've shown as being your desired output.

import requests
from lxml import etree

with requests.Session() as session:
    r = session.get('https://www.peakbagger.com/peak.aspx?pid=10882')
    r.raise_for_status()
    tree = etree.HTML(r.text)
    print(' '.join(tree.xpath('//table[@class="gray"][1]/*//br')[1].tail.split()[:2]))

CodePudding user response:

As Brutus mentioned, it is very specific and if you wont to use etree this could be an alternativ.

  • find() with the string Latitude/Longitude (WGS84)
  • findNext() its next
  • grab its contents
  • replace the , and split it by the whitespace
  • by slicing the result to the first two elements you will get the list with lat and long.

.

data = soup.find('td', string='Latitude/Longitude (WGS84)')\
            .findNext('td')\
            .contents[2]\
            .replace(',','')\
            .split()[:2]

data

EDIT

You have a list of urls and loop over it - To be considerate of the website and not to be banned, run over the pages with some delay (time.sleep()).

import time
import requests
from bs4 import BeautifulSoup
urls = ['https://www.peakbagger.com/peak.aspx?pid=10882',
 'https://www.peakbagger.com/peak.aspx?pid=10866',
 'https://www.peakbagger.com/peak.aspx?pid=10840',
 'https://www.peakbagger.com/peak.aspx?pid=10868',
 'https://www.peakbagger.com/peak.aspx?pid=10832']

data = {}

for url in urls:
    
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    

    ll= soup.find('td', string='Latitude/Longitude (WGS84)')\
                .findNext('td')\
                .contents[2]\
                .replace(',','')\
                .split()[:2]
    
    data[soup.select_one('h1').get_text()]={
        'url':url,
        'lat':ll[0],
        'long':ll[1]
    }
 
    time.sleep(3)

data

Output

{'Fuji-san, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10882',
  'lat': '35.360638',
  'long': '138.727347'},
 'Kita-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10866',
  'lat': '35.674537',
  'long': '138.238833'},
 'Hotaka-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10840',
  'lat': '36.289203',
  'long': '137.647986'},
 'Aino-dake, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10868',
  'lat': '35.646037',
  'long': '138.228292'},
 'Yariga-take, Japan': {'url': 'https://www.peakbagger.com/peak.aspx?pid=10832',
  'lat': '36.34198',
  'long': '137.647625'}}
  • Related