Hello everyone I have a website that I need to save data into an excel page. But the data isn't in a table format which I would normally use pandas to deal with. Below is a section of the website I put for example. Along with the code I used to pull the exact information I need/want.
from bs4 import BeautifulSoup
html_doc = """
<div >
<p>
<span >Order Number</span><br>
A-21-897274
</p>
</div>
<div >
<p>
<span >Location</span><br>
Ohio
</p>
</div>
<div >
<p>
<span >Date</span><br>
07/01/2022
</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
def correct_tag(tag):
return tag.name == "span" and tag.get_text(strip=True) in {
"Order Number",
"Location",
"Date",
}
for t in soup.find_all(correct_tag):
print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")
This works perfectly and pulls the data I want like below:
Order Number: A-21-897274
Location: Ohio
Date: 07/01/2022
I just need help with getting this data into a data frame to save for excel. any help would be appreciated!
CodePudding user response:
Store the data as a dict
. If you have many orders, append them to list
. Finally, convert the list to dataframe
.
import pandas as pd
order_list = []
order_info = {}
for t in soup.find_all(correct_tag):
order_info[t.text] = t.find_next_sibling(text=True).strip()
# assume you have many orders (append to list first)
order_list.append(order_info)
order_df = pd.DataFrame(order_list)
order_df.head()
output:
Order Number Location Date 0 A-21-897274 Ohio 07/01/2022