Convert HTML to Pandas DataFrame-CodePudding

I have a beautifulsoup object as follows

<div >
<div>Industry<a href="/stock-screener/?sp=country::5|sector::a|industry::146|equityType::a&lt;eq_market_cap;1">Life Sciences Tools &amp; Services</a></div>
<div>Sector<a href="/stock-screener/?sp=country::5|sector::18|industry::a|equityType::a&lt;eq_market_cap;1">Healthcare</a></div>
<div>Employees<p >17000</p></div>
<div>Equity Type<p >ORD</p></div>
</div>

I want to convert the above into a Pandas DataFrame as follows

Expected Output

 -------------------------------- ------------ ----------- ------------- 
|            Industry            |   Sector   | Employees | Equity Type |
 -------------------------------- ------------ ----------- ------------- 
| Life Sciences Tools & Services | Healthcare |     17000 | ORD         |
 -------------------------------- ------------ ----------- -------------

Suppose that the bs object is named divlist I've extracted the text within using divlist.text but can't slice it appropriately to achieve above data frame.

CodePudding user response：

I have taken your data as html and you can iterate to specific class using find_all method and i have used list Comprehension to get text and it is separated by ~ symbol

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
lst=[i.get_text(strip=True,separator="~") for i in soup.find("div",class_="companyProfileHeader").find_all("div")]
final_lst=[i.split("~") for i in lst ]

Now you can transform into DataFrame using final_lst

import pandas as pd
df=pd.DataFrame(final_lst)
df=df.transpose()
df.rename(columns=df.iloc[0], inplace = True)
df.drop(df.index[0], inplace = True)

CodePudding user response：

Here is how I did it.

from bs4 import BeautifulSoup
text = """<div >
<div>Industry<a href="/stock-screener/?sp=country::5|sector::a|industry::146|equityType::a&lt;eq_market_cap;1">Life Sciences Tools &amp; Services</a></div>
<div>Sector<a href="/stock-screener/?sp=country::5|sector::18|industry::a|equityType::a&lt;eq_market_cap;1">Healthcare</a></div>
<div>Employees<p >17000</p></div>
<div>Equity Type<p >ORD</p></div>
</div>
"""
soup = BeautifulSoup(text, 'html.parser')

The following would then extract the column and row respectively:

column = [i.next for i in soup.find_all('div', {'class': ''})]
row = [i.next.next.text for i in soup.find_all('div', {'class': ''})]

Next, create a dataframe.

import pandas as pd

df = pd.DataFrame(columns=column)
df = df.append(pd.Series(row, index=column), ignore_index=True)