I want to create a df with an historical dataset by scrapping a website, but I struggle to accumulate the full period within the loop. I am able to download a day, but when I try to create a loop to storage a set of iterations I am not able to accumulate the data in the dataframe.
The df I want to create from the start_date
to the end_date
is as follows:
Fecha | PeríodeTU | TM°C | HRM% |
---|---|---|---|
single_date |
Where Fecha is a result of adding a columns with the single_date
of the code below, and the rest of the columns are actual data from the website scrapped.
I have tried this:
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date timedelta(n)
start_date = date(2020, 6, 1)
end_date = date(2021, 3, 3)
for single_date in daterange(start_date, end_date):
#URL API Meteo.cat con la fecha
url = "https://www.meteo.cat/observacions/xema/dades?codi=V3&dia=" str(single_date) "T00:00Z"
# GET a la API
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[2]
df_table = pd.read_html(str(table))[0]
df_table['Fecha'] = single_date
data['Fecha'] = df['Fecha']
data['Hora'] = df['PeríodeTU']
data['Temperatura_Media'] = df['TM°C']
data['Humedad_Relativa'] = df['HRM%']
data.to_csv('Data/tempset.csv', header=True, index=False)
df_table
only saves the last date, and I want to save the full period iterated.
Does anyone know how to deal with this situation?
CodePudding user response:
You can create a list and the concatenate it:
dfs = []
for single_date in daterange(start_date, end_date):
#URL API Meteo.cat con la fecha
url = "https://www.meteo.cat/observacions/xema/dades?codi=V3&dia=" str(single_date) "T00:00Z"
# GET a la API
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[2]
dfs.append(pd.read_html(str(table))[0].assign(Fecha = single_date))
And finally after running the loop:
df_table = pd.concat(dfs)
This will create df_table with all the individual observations from the dataframes based on your loop.
CodePudding user response:
If I understood right, you are misssing a concat line:
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date timedelta(n)
start_date = date(2020, 6, 1)
end_date = date(2021, 3, 3)
df_agg = pd.DataFrame() # define an empty df that will aggregate the inputs
for single_date in daterange(start_date, end_date):
#URL API Meteo.cat con la fecha
url = "https://www.meteo.cat/observacions/xema/dades?codi=V3&dia=" str(single_date) "T00:00Z"
# GET a la API
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[2]
df_table = pd.read_html(str(table))[0]
df_table['Fecha'] = single_date
df_agg = pd.concat([df_agg, df_table]) # appending!
data['Fecha'] = df['Fecha']
data['Hora'] = df['PeríodeTU']
data['Temperatura_Media'] = df['TM°C']
data['Humedad_Relativa'] = df['HRM%']
data.to_csv('Data/tempset.csv', header=True, index=False)
Let me know if it worked for you