Looking to download href links using beautifulsoup4, python 3 and requests library.
This is the code that I have now, I thought it would be tough to use regex in this situation but i'm not sure if this can be done using beautifulsoup3. I have to download all of the shape files from the grid and looking to automate this task. Thank You!
URL: https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads'
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
print(results)
CodePudding user response:
Those files are all associated with area
tag so I would simply select those:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads')
soup = bs(r.content, 'lxml')
files = ['https://earth-info.nga.mil' i['href'] for i in soup.select('area')]
CodePudding user response:
You can convert page
to a string in order to search for all a
's using regex.
Instead of:
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
Use:
results = re.findall(r'<a[^>]* href="([^"]*)"', page.text)