I don't know if it was posted before but I could not find it neither on Stackoverflow nor on the rest of the web
I'm working on a project where we need a live dataset so we can update our database everyday. I found a GitHub repository where a csv file is updated everyday and I need to download it to my local as I run a code. How am I supposed to do that ?
We are using Python and PostgreSQL
CodePudding user response:
There is already an answer for your question, check this link below:
How to download a file from Github using Requests
If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):
https://raw.github.com/someguy/brilliant/master/somefile.txt
Note the change in domain name, and theblob/
part of the path is gone.
Try this code to download the COVID-2019 20-20ECDC 20(2020).csv
file:
>>> import requests
>>> r = requests.get('https://github.com/owid/covid-19-data/blob/master/public/data/ecdc/COVID-2019 - ECDC (2020).csv')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://github.com/owid/covid-19-data/blob/master/public/data/ecdc/COVID-2019 - ECDC (2020).csv')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
CodePudding user response:
You can automate this by using the request library and downloading the CSV file using the link to the file from github. As long as its file that it is the file that is being updated every time this runs it should be the most recent data.
import requests
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv'
r = requests.get(url, allow_redirects=True)
open('data.csv', 'wb').write(r.content)
CodePudding user response:
find the path of your desidered csv (raw format
):
csv_url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv'
Then read it in python with :
df = pd.read_csv(csv_url, error_bad_lines=False)
print(df)
You can now update your sql db with:
df.to_sql('my_SQL_table', con=my_engine, if_exists='replace')