I'm trying to write a short python snippet of code that loops through different webpages structured in the same way (i.e. same number of columns / rows) and loads all the information into a pandas dataframe and finally exports this one into excel.
I managed to write all the code that gathers what should be the column headers (in the dt HTML tag) and the rows (in the dd HTML tag), but having issues into placing all this info into a pandas dataframe.
for row in rows:
QA_link = row.find('td', class_='views-field views-field-nothing-1').find('a', href=True)['href']
req_QA = requests.get(QA_link)
soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
QA_table = soup_QA.find('dl', class_='dl-horizontal SingleRulebookRecord')
if boolInitialiseTable:
QA_hdr = [str.replace(link.string, ':', '') for link in QA_table.findAll('dt')]
QA_details = [str(link.string) for link in QA_table.findAll('dd')]
df = pd.DataFrame()
df = pd.concat([df, pd.DataFrame(QA_details).transpose()], ignore_index=True, axis=0)
boolInitialiseTable = False
df.columns = QA_hdr
else:
QA_details = [str(link.string) for link in QA_table.findAll('dd')]
df = pd.concat([df, pd.DataFrame(QA_details).transpose()])
Where rows contains all the different web pages that needs to be accessed to gather the info i need to put in the pandas dataframe.
So from the HTML table like content of:
<dl >
<dt>Question ID:</dt>
<dd>2020_5469 </dd>
<dt>Topic:</dt>
<dd>Weather</dd>
<dt>Date</dt>
<dd>06/06/2020</dd>
</dl>
I would like to get a pandas dataframe with:
Question ID | Topic | Date |
---|---|---|
2020_5469 | Weather | 06/06/2020 |
Finally df.to_excel('results.xlsx') should do the job of exporting everything into Excel.
I feel that all this transpose in the code is not the correct way of doing it, in addition to that the type of the fields of the table is object and not string as i would expect - but maybe this is not a problem
CodePudding user response:
I would do it like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
links = ['https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5469',
'https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5128']
dfs = []
for QA_link in links:
req_QA = requests.get(QA_link)
soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
QA_hdr = [link.get_text() for link in soup_QA.findAll('dt')]
QA_details = [[link.get_text() for link in soup_QA.findAll('dd')]]
dfs.append(pd.DataFrame(QA_details, columns=QA_hdr))
df_all = pd.concat(dfs, axis=0).reset_index(drop=True)
# check for NaN values (columns not shared between urls)
print(df_all[df_all.columns[df_all.isna().any()]].T)
0 1
Name of institution / submitter: BearingPoint Switzerland AG NaN
Country of incorporation / residence: Switzerland NaN
Answer prepared by: Answer prepared by the EBA. NaN
Subparagraph: NaN (f)
df_all.iloc[:,:5].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Question ID: 2 non-null object
1 Legal Act: 2 non-null object
2 Topic: 2 non-null object
3 Article: 2 non-null object
4 Paragraph: 2 non-null object
dtypes: object(5)
memory usage: 208.0 bytes
Notice that QA_details
is a nested list. E.g. each nested list would fill a new row; it's just that you only have one. E.g. here's how it works if you have two nested lists:
lst = [[1,2],[3,4]]
df = pd.DataFrame(lst, columns=['A','B'])
print(df)
A B
0 1 2
1 3 4
As for the reason why the Dtype
is given as object
, see e.g. this SO post. But all your cells will in fact contain strings, which we can easily check. E.g.:
cols = df_all.columns[df_all.notna().all()]
print(all([isinstance(i, str) for i in df_all.loc[0, cols]]))
# True
Finally, yes df.to_excel('results.xlsx')
will work to export the df
to Excel. Perhaps add df.to_excel('results.xlsx', index=False)
to avoid exporting the index
.