A CSV file is periodically uploaded to a known, constant URL (url_variable). I want to automatically download the latest iteration of that CSV file in the course of a Python script.
I have tried using Pandas, specifically pd.read_csv(url_variable), but I receive the "HTTP Error 403: Forbidden."
Next I tried using urllib and passing in spoofed headers (headers_variable), specifically urllib.requests.Request(url_variable, headers=headers_variable). This technique works. However, when a new CSV file is uploaded to the URL and the script is repeated, the old CSV file is returned.
How can I alter my code to download the new CSV file each time this block is called?
CodePudding user response:
Check if url is the same for new CSV uploads. If it's the same just downloading it should work.
Here's an example of downloading a CSV file in memory and reading it directly using requests and pandas:
from io import StringIO
import pandas as pd
import requests
if __name__ == "__main__":
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
headers = {"Authorization": "Test"}
response = requests.get(url, headers=headers)
df = pd.read_csv(StringIO(response.text))
print(df.shape)
Of course, adjust headers as you wish. If the file is large, you could use a temporary file in order to process it, see: Generate temporary files and directories