Python BeautifulSoup - Create dataframe using html tags between <div>-CodePudding

I have a html website without any tables and I want to scrap data in form of a table. Here is the sample html code

    <div class='ah-content'
      <h4>XYZ Community</h4>
      <p>123 Street</p>
      <p>Atlanta, Georgia, 12345</p>
      <p>1234567890</p>
     </div>

It is a long list like this and I want to capture <h4> and <p> between <div>

So, the output will be:

Name	Address	Address2	Phone
xyz Community	123 Street	Atlanta, Georgia, 12345	1234567890

CodePudding user response：

If all <div class='ah-content'> follows the same pattern like in your example you can use this script to create a DataFrame:

import pandas as pd
from bs4 import BeautifulSoup


html_doc = """\
<div class='ah-content'>
  <h4>XYZ Community</h4>
  <p>123 Street</p>
  <p>Atlanta, Georgia, 12345</p>
  <p>1234567890</p>
 </div>"""

soup = BeautifulSoup(html_doc, "html.parser")

strings = [[t.text for t in c.find_all()] for c in soup.select(".ah-content")]

df = pd.DataFrame(strings, columns=["Name", "Address", "Address2", "Phone"])
print(df.to_markdown(index=False))

Prints:

Name	Address	Address2	Phone
XYZ Community	123 Street	Atlanta, Georgia, 12345	1234567890