Home > database >  How to scrape Wikipedia GPS latitude/longitude in HTML DOM?
How to scrape Wikipedia GPS latitude/longitude in HTML DOM?

Time:11-30

I have been wondering how is it possible to scrap Wikipedia information. For example, I have a list of world cities and want to obtain their approximate latitude and longitude. Take Miami as an example. When I type curl https://en.wikipedia.org/wiki/Miami | grep -E '(latitude|longitude)', somewhere in the HTML there will be a tag mark like below.

<span >25°46′31″N</span> <span >80°12′31″W</span>

I know I can extract it with some regex string, but I speak a very poor regexish. Can some of you help me on this?

CodePudding user response:

With and :

$ xidel -e '
    concat(
        (//span[@]/text())[position()=1],
        " ",
        (//span[@]/text())[position()=1]
    )
' 'https://en.wikipedia.org/wiki/Miami'

Output

25°46′31″N 80°12′31″W

If you need to transform GPS in numeric output:

xidel -se '(//span[@]/text())[position()=1] | (//span[@]/text())[position()=1]' 'https://en.wikipedia.org/wiki/Miami' |
    perl -pe 's|^(\d )\D (\d )\D (\d ).*|$1 ($2/60) ($3/60)/60|e'  
25.7752777777778
80.2086111111111

Or

saxon-lint --html --xpath '<XPATH EXP>' <URL>

If you want most known tools:

curl -s 'https://en.wikipedia.org/wiki/Miami' > Miami.html
xmlstarlet format -H Miami.html 2>/dev/null | sponge Miami.html
xmlstarlet sel -t -v '<XPATH EXP>' Miami.html

Not mentioned, but regex are not the right tool to parse HTML

  • Related