Home > Back-end >  Saving scraped data with no tables into pandas
Saving scraped data with no tables into pandas

Time:03-14

Hello everyone I have a website that I need to save data into an excel page. But the data isn't in a table format which I would normally use pandas to deal with. Below is a section of the website I put for example. Along with the code I used to pull the exact information I need/want.

from bs4 import BeautifulSoup


html_doc = """
<div >
    <p>
      <span >Order Number</span><br>
      A-21-897274
    </p>
</div>
<div >
  <p>
    <span >Location</span><br>
    Ohio
  </p>
</div>
  <div >
    <p>
      <span >Date</span><br>
      07/01/2022
    </p>
  </div>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")


def correct_tag(tag):
    return tag.name == "span" and tag.get_text(strip=True) in {
        "Order Number",
        "Location",
        "Date",
    }


for t in soup.find_all(correct_tag):
    print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")

This works perfectly and pulls the data I want like below:

Order Number: A-21-897274
Location: Ohio
Date: 07/01/2022

I just need help with getting this data into a data frame to save for excel. any help would be appreciated!

CodePudding user response:

Store the data as a dict. If you have many orders, append them to list. Finally, convert the list to dataframe.

import pandas as pd

order_list = []
order_info = {}

for t in soup.find_all(correct_tag):
    order_info[t.text] = t.find_next_sibling(text=True).strip()

# assume you have many orders (append to list first)
order_list.append(order_info)

order_df = pd.DataFrame(order_list)
order_df.head()

output:

  Order Number    Location    Date
0 A-21-897274     Ohio        07/01/2022
  • Related