I webscraped this webpage.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)
data = []
u = soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({ soup.select('main .section p:not([class])')})
print(data)
df = pd.DataFrame(data)
# results (it may not be the same text
[... <p><strong>Duisenberg:</strong> My answer is, well, in the first place when something is before the courts you do not comment. I don't comment and particularly not when it concerns such an esteemed colleague of mine. So, on the hypothetical question whether other people would be eligible for the job, I think it is wise not to go into that either. </p>]
The problem is that when I turn data
into a dataframe, it remains in a list format which is difficult to handle. I would like it to be saved as a unique object without losing its properties (</p>
,</strong>
).
If I do this, it loses the division in pararaphs and bolds that will be needed for manipulation.
data = []
u = soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({
'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
})
df = pd.DataFrame(data)
# with this however I lose the breakdown in paragraphs, bold characters etc. I'd like to keep them in the text.
Can anyone help me with this?
Thanks!
CodePudding user response:
Note sure if I understand it correctly, but if you like to convert the resultset to text you can do it like that:
''.join([str(e) for e in soup.select('main .section p:not([class])')])
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)
data = []
u = soup.select('div.title > a')
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({'text':''.join([str(e) for e in soup.select('main .section p:not([class])')])})
pd.DataFrame(data)
Output
text
<p>Good afternoon, the Vice-President and I welcome you to our press conference. </p><p id="_Hlk93669934">The euro area economy is continuing to r...