Home > Software engineering >  How to turn list into dataframe without losing text format in Python?
How to turn list into dataframe without losing text format in Python?

Time:02-13

I webscraped this webpage.


from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)

data = []

u = soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
    data.append({ soup.select('main .section p:not([class])')})

print(data)

df = pd.DataFrame(data)

# results (it may not be the same text
[... <p><strong>Duisenberg:</strong> My answer is, well, in the first place when something is before the courts you do not comment. I don't comment and particularly not when it concerns such an esteemed colleague of mine. So, on the hypothetical question whether other people would be eligible for the job, I think it is wise not to go into that either. </p>]

The problem is that when I turn data into a dataframe, it remains in a list format which is difficult to handle. I would like it to be saved as a unique object without losing its properties (</p>,</strong>).

If I do this, it loses the division in pararaphs and bolds that will be needed for manipulation.

data = []

u = soup.select('div.title > a'):
    soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
    data.append({
        'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
    })

df = pd.DataFrame(data)

# with this however I lose the breakdown in paragraphs, bold characters etc. I'd like to keep them in the text.

Can anyone help me with this?

Thanks!

CodePudding user response:

Note sure if I understand it correctly, but if you like to convert the resultset to text you can do it like that:

''.join([str(e) for e in soup.select('main .section p:not([class])')])

Example

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)

data = []

u = soup.select('div.title > a')
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({'text':''.join([str(e) for e in soup.select('main .section p:not([class])')])})

pd.DataFrame(data)

Output

text
<p>Good afternoon, the Vice-President and I welcome you to our press conference. </p><p id="_Hlk93669934">The euro area economy is continuing to r...
  • Related