Home > Software engineering >  How to download website source files in python?
How to download website source files in python?

Time:06-04

Given a website (for example stackoverflow.com) I want to download all the files under:

(Right Click) -> Inspect -> Sources -> Page

Please Try it yourself and see the files you get.

How can I do that in python? I know how to retrive page source but not the source files.

I tried searching this multiple times with no success and there is a confusion between sources (files) and page source.

Please Note, I'm looking for a an approach or example rather than ready-to-use code.

For example, I want to gather all of these files under top:

enter image description here

CodePudding user response:

To download website source files (mirroring websites / copy source files from websites) you may try PyWebCopy library.

To save any single page -

from pywebcopy import save_webpage
save_webpage(
      url="https://httpbin.org/",
      project_folder="E://savedpages//",
      project_name="my_site",
      bypass_robots=True,
      debug=True,
      open_in_browser=True,
      delay=None,
      threaded=False,
)

To save full website -

from pywebcopy import save_website
save_website(
url="https://httpbin.org/",
project_folder="E://savedpages//",
project_name="my_site",
bypass_robots=True,
debug=True,
open_in_browser=True,
delay=None,
threaded=False,
)

You can also check tools like httrack which comes with a GUI to download website files (mirror).

On the other hand to download web-page source code (HTML pages) -

import requests

url = 'https://stackoverflow.com/questions/72462419/how-to-download-website-source-files-in-python'
html_output_name = 'test2.html'

req = requests.get(url, 'html.parser', headers={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'})

with open(html_output_name, 'w') as f:
    f.write(req.text)
    f.close()

  • Related