Load webscraping results into pandas dataframe and export to excel-CodePudding

I'm trying to write a short python snippet of code that loops through different webpages structured in the same way (i.e. same number of columns / rows) and loads all the information into a pandas dataframe and finally exports this one into excel.

I managed to write all the code that gathers what should be the column headers (in the dt HTML tag) and the rows (in the dd HTML tag), but having issues into placing all this info into a pandas dataframe.

    for row in rows:
        QA_link = row.find('td', class_='views-field views-field-nothing-1').find('a', href=True)['href']
        req_QA = requests.get(QA_link)
        soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
        QA_table = soup_QA.find('dl', class_='dl-horizontal SingleRulebookRecord')
        if boolInitialiseTable:
            QA_hdr = [str.replace(link.string, ':', '') for link in QA_table.findAll('dt')]
            QA_details = [str(link.string) for link in QA_table.findAll('dd')]
            df = pd.DataFrame()
            df = pd.concat([df, pd.DataFrame(QA_details).transpose()], ignore_index=True, axis=0)
            boolInitialiseTable = False
            df.columns = QA_hdr
        else:
            QA_details = [str(link.string) for link in QA_table.findAll('dd')]
            df = pd.concat([df, pd.DataFrame(QA_details).transpose()])

Where rows contains all the different web pages that needs to be accessed to gather the info i need to put in the pandas dataframe.

So from the HTML table like content of:

<dl >
       <dt>Question ID:</dt>
       <dd>2020_5469 </dd>
       <dt>Topic:</dt>
       <dd>Weather</dd>
       <dt>Date</dt>
       <dd>06/06/2020</dd>
</dl>

I would like to get a pandas dataframe with:

Question ID	Topic	Date
2020_5469	Weather	06/06/2020

Finally df.to_excel('results.xlsx') should do the job of exporting everything into Excel.

I feel that all this transpose in the code is not the correct way of doing it, in addition to that the type of the fields of the table is object and not string as i would expect - but maybe this is not a problem

CodePudding user response：

I would do it like this:

import requests
from bs4 import BeautifulSoup
import pandas as pd

links = ['https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5469',
         'https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5128']

dfs = []

for QA_link in links:
    req_QA = requests.get(QA_link)
    soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
    
    QA_hdr = [link.get_text() for link in soup_QA.findAll('dt')]
    QA_details = [[link.get_text() for link in soup_QA.findAll('dd')]]

    dfs.append(pd.DataFrame(QA_details, columns=QA_hdr))

df_all = pd.concat(dfs, axis=0).reset_index(drop=True)

# check for NaN values (columns not shared between urls)
print(df_all[df_all.columns[df_all.isna().any()]].T)

                                                                 0    1
Name of institution / submitter:       BearingPoint Switzerland AG  NaN
Country of incorporation / residence:                  Switzerland  NaN
Answer prepared by:                    Answer prepared by the EBA.  NaN
Subparagraph:                                                  NaN  (f)

df_all.iloc[:,:5].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Question ID:  2 non-null      object
 1   Legal Act:    2 non-null      object
 2   Topic:        2 non-null      object
 3   Article:      2 non-null      object
 4   Paragraph:    2 non-null      object
dtypes: object(5)
memory usage: 208.0  bytes

Notice that QA_details is a nested list. E.g. each nested list would fill a new row; it's just that you only have one. E.g. here's how it works if you have two nested lists:

lst = [[1,2],[3,4]]
df = pd.DataFrame(lst, columns=['A','B'])

print(df)

   A  B
0  1  2
1  3  4

As for the reason why the Dtype is given as object, see e.g. this SO post. But all your cells will in fact contain strings, which we can easily check. E.g.:

cols = df_all.columns[df_all.notna().all()]
print(all([isinstance(i, str) for i in df_all.loc[0, cols]]))
# True

Finally, yes df.to_excel('results.xlsx') will work to export the df to Excel. Perhaps add df.to_excel('results.xlsx', index=False) to avoid exporting the index.