How to accumulate in a df parsed data through a loop with pandas from a web scrapping?-CodePudding

I want to create a df with an historical dataset by scrapping a website, but I struggle to accumulate the full period within the loop. I am able to download a day, but when I try to create a loop to storage a set of iterations I am not able to accumulate the data in the dataframe.

The df I want to create from the start_date to the end_date is as follows:

Fecha	PeríodeTU	TM°C	HRM%
single_date

Where Fecha is a result of adding a columns with the single_date of the code below, and the rest of the columns are actual data from the website scrapped.

I have tried this:

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date   timedelta(n)

start_date = date(2020, 6, 1)
end_date = date(2021, 3, 3)


for single_date in daterange(start_date, end_date):
    #URL API Meteo.cat con la fecha
    url = "https://www.meteo.cat/observacions/xema/dades?codi=V3&dia=" str(single_date) "T00:00Z"        

    # GET a la API
    res = requests.get(url)
    soup = BeautifulSoup(res.content,'lxml')
    table = soup.find_all('table')[2]
    df_table = pd.read_html(str(table))[0]
    df_table['Fecha'] = single_date


data['Fecha'] = df['Fecha']
data['Hora'] = df['PeríodeTU']
data['Temperatura_Media'] = df['TM°C']
data['Humedad_Relativa'] = df['HRM%']
data.to_csv('Data/tempset.csv', header=True, index=False)

df_table only saves the last date, and I want to save the full period iterated.

Does anyone know how to deal with this situation?

CodePudding user response：

You can create a list and the concatenate it:

dfs = []
for single_date in daterange(start_date, end_date):
    #URL API Meteo.cat con la fecha
    url = "https://www.meteo.cat/observacions/xema/dades?codi=V3&dia=" str(single_date) "T00:00Z"    

    # GET a la API
    res = requests.get(url)
    soup = BeautifulSoup(res.content,'lxml')
    table = soup.find_all('table')[2]
    dfs.append(pd.read_html(str(table))[0].assign(Fecha = single_date))

And finally after running the loop:

df_table = pd.concat(dfs)

This will create df_table with all the individual observations from the dataframes based on your loop.

CodePudding user response：

If I understood right, you are misssing a concat line:

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date   timedelta(n)

start_date = date(2020, 6, 1)
end_date = date(2021, 3, 3)
df_agg = pd.DataFrame() # define an empty df that will aggregate the inputs

for single_date in daterange(start_date, end_date):
    #URL API Meteo.cat con la fecha
    url = "https://www.meteo.cat/observacions/xema/dades?codi=V3&dia=" str(single_date) "T00:00Z"        

    # GET a la API
    res = requests.get(url)
    soup = BeautifulSoup(res.content,'lxml')
    table = soup.find_all('table')[2]
    df_table = pd.read_html(str(table))[0]
    df_table['Fecha'] = single_date
    df_agg = pd.concat([df_agg, df_table]) # appending!


data['Fecha'] = df['Fecha']
data['Hora'] = df['PeríodeTU']
data['Temperatura_Media'] = df['TM°C']
data['Humedad_Relativa'] = df['HRM%']
data.to_csv('Data/tempset.csv', header=True, index=False)

Let me know if it worked for you