Home > Software engineering >  Beautiful Soup data extract
Beautiful Soup data extract

Time:11-21

Have an local .html from which I am extracting point data, parsed with BeautifulSoup but I don't know how to extract the date that is inside a div, the parse array is the following:

<div ><div><div><a href="https://www.instagram.com/chuckbasspics" target="_blank">chuckbasspics</a></div><div>Jan 7, 2013, 5:41 AM</div></div></div><div ></div></div><div ><div ><div><div>

Any idea how to do it?

I already extracted the users and urls (href) with the following code:

fl_html = open('followers.html', "r")
index = fl_html.read()
soup = BeautifulSoup(index, 'lxml')

usernames = soup.find_all('a', href=True)


for i in usernames:
    users.append(i.get_text(strip=True))
    url_follower.append(i['href'])

CodePudding user response:

You can use bs4 API or CSS selector:

from bs4 import BeautifulSoup

html_doc = """<div ><div><div><a href="https://www.instagram.com/chuckbasspics" target="_blank">chuckbasspics</a></div><div>Jan 7, 2013, 5:41 AM</div></div></div><div ></div></div><div ><div ><div><div>"""

soup = BeautifulSoup(html_doc, "html.parser")

Extracting the date using .get_text() with separator=

You can get all text from the HTML snippet with custom separator, then .split:

t = soup.get_text(strip=True, separator="|").split("|")
print(t[1])

Prints:

Jan 7, 2013, 5:41 AM

CSS selector

Find next sibling to <div> which contains <a>:

t = soup.select_one("div:has(a)   div")
print(t.text)

Print:

Jan 7, 2013, 5:41 AM

Using bs4 API

Time must contain PM or AM, so select <div> which contains this string:

t = soup.find("div", text=lambda t: t and (" AM" in t or " PM" in t))
print(t.text)

Prints:

Jan 7, 2013, 5:41 AM
  • Related