Home > database >  Downloading a CSV file from a dynamic webpage in Python
Downloading a CSV file from a dynamic webpage in Python

Time:10-29

A CSV file is periodically uploaded to a known, constant URL (url_variable). I want to automatically download the latest iteration of that CSV file in the course of a Python script.

I have tried using Pandas, specifically pd.read_csv(url_variable), but I receive the "HTTP Error 403: Forbidden."

Next I tried using urllib and passing in spoofed headers (headers_variable), specifically urllib.requests.Request(url_variable, headers=headers_variable). This technique works. However, when a new CSV file is uploaded to the URL and the script is repeated, the old CSV file is returned.

How can I alter my code to download the new CSV file each time this block is called?

CodePudding user response:

Check if url is the same for new CSV uploads. If it's the same just downloading it should work.

Here's an example of downloading a CSV file in memory and reading it directly using requests and pandas:

from io import StringIO
import pandas as pd
import requests
                
if __name__ == "__main__":
        
    url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
    headers = {"Authorization": "Test"}
    response = requests.get(url, headers=headers)
    df = pd.read_csv(StringIO(response.text))
    print(df.shape)

Of course, adjust headers as you wish. If the file is large, you could use a temporary file in order to process it, see: Generate temporary files and directories

  • Related