Home > other >  BeautifulSoup won't consider linebreaks while parsing a table
BeautifulSoup won't consider linebreaks while parsing a table

Time:09-17

I would like to parse a table using BS and ultimately put some if its content into a dataframe. For tables there is only one line of text in each row this works fine. There are however cases where some tables cells contain multiple lines of text. One of those examples looks like this: enter image description here

So...

Maybe:

import pandas as pd
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}

url = 'https://www.sec.gov/Archives/edgar/data/0001000230/000119312507035211/ddef14a.htm'

page = requests.get(url, headers=headers)
tables = pd.read_html(page.text.replace('<BR>','\n'))

df = pd.DataFrame(tables[50])

# df.columns = pd.MultiIndex.from_arrays([df.iloc[1],df.iloc[2]])
# line above or the line below - pick one, comment the other out
df.columns = df.iloc[2]

df = df.iloc[3:]
df = df.reset_index(drop=True)

df = df.dropna(axis=1)

df = df.loc[:,~(df=='$').any()]

display(df)

Output:

enter image description here

Exporting:

from styleframe import StyleFrame
StyleFrame(df).to_excel('out.xlsx').save()

enter image description here

Seems that those squares are some \x97 chars in there you can obviously remove when cleaning it up.

  • Related