Use beautifulsoup to download href links-CodePudding

Looking to download href links using beautifulsoup4, python 3 and requests library.

This is the code that I have now, I thought it would be tough to use regex in this situation but i'm not sure if this can be done using beautifulsoup3. I have to download all of the shape files from the grid and looking to automate this task. Thank You!

URL: https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads

import requests
from bs4 import BeautifulSoup
import re


URL = 'https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads'

page =  requests.get(URL)

soup = BeautifulSoup(page.content,'html.parser')




results = re.findall(r'<a[^>]* href="([^"]*)"', page)

print(results)

CodePudding user response：

Those files are all associated with area tag so I would simply select those:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads')
soup = bs(r.content, 'lxml')
files = ['https://earth-info.nga.mil'   i['href'] for i in soup.select('area')]

CodePudding user response：

You can convert page to a string in order to search for all a's using regex.

Instead of:

results = re.findall(r'<a[^>]* href="([^"]*)"', page)

Use:

results = re.findall(r'<a[^>]* href="([^"]*)"', page.text)