Have an local .html from which I am extracting point data, parsed with BeautifulSoup but I don't know how to extract the date that is inside a div
, the parse array is the following:
<div ><div><div><a href="https://www.instagram.com/chuckbasspics" target="_blank">chuckbasspics</a></div><div>Jan 7, 2013, 5:41 AM</div></div></div><div ></div></div><div ><div ><div><div>
Any idea how to do it?
I already extracted the users and urls (href) with the following code:
fl_html = open('followers.html', "r")
index = fl_html.read()
soup = BeautifulSoup(index, 'lxml')
usernames = soup.find_all('a', href=True)
for i in usernames:
users.append(i.get_text(strip=True))
url_follower.append(i['href'])
CodePudding user response:
You can use bs4
API or CSS selector:
from bs4 import BeautifulSoup
html_doc = """<div ><div><div><a href="https://www.instagram.com/chuckbasspics" target="_blank">chuckbasspics</a></div><div>Jan 7, 2013, 5:41 AM</div></div></div><div ></div></div><div ><div ><div><div>"""
soup = BeautifulSoup(html_doc, "html.parser")
Extracting the date using .get_text()
with separator=
You can get all text from the HTML snippet with custom separator, then .split
:
t = soup.get_text(strip=True, separator="|").split("|")
print(t[1])
Prints:
Jan 7, 2013, 5:41 AM
CSS selector
Find next sibling to <div>
which contains <a>
:
t = soup.select_one("div:has(a) div")
print(t.text)
Print:
Jan 7, 2013, 5:41 AM
Using bs4
API
Time must contain PM
or AM
, so select <div>
which contains this string:
t = soup.find("div", text=lambda t: t and (" AM" in t or " PM" in t))
print(t.text)
Prints:
Jan 7, 2013, 5:41 AM