I am trying to write a block of code that will parse through any lists, ordered or unordered, and print the elements to the screen. I think the code here only will give me unordered list data - also, it is not in working order. I am not super familiar with using html, much more comfortable with python though. Hoping someone can help me rework my code, help me understand it better and get it in working order.
Here is my code:
from html.parser import HTMLParser
from urllib.parse import urljoin
from urllib.request import urlopen
def testLParser(url):
content = urlopen(url).read().decode()
parser = ListParser()
parser.feed(content)
return parser.getListItems()
class ListParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.f = 0
self.counter = 0
self.ul = []
self.ol = []
def handle_starttag(self, tag, attrs):
if tag == 'ol':
self.counter = 1
if tag == 'ul':
self.counter = 2
if tag != 'li':
return
if self.f:
self.f =1
return
self.f = 1
def handle_endtag(self, tag):
if tag == 'ol' or tag == 'ul':
self.counter = 0
if tag == 'li' and self.f:
self.f -= 1
def handle_data(self, data):
if self.f:
if self.counter == 2:
self.ul.append(data)
elif self.counter == 1:
self.ol.append(data)
def getListItems(self):
new = self.ul self.ol
print(new) #this is what is causing me the only error that shows up right now
print(testLParser('http://zoko.cdm.depaul.edu/csc242/lists.html'))
Thanks!
CodePudding user response:
I would recommend selenium or beautiful soup. In this demonstration I will be using selenium. Printing all elements on the screen is pretty basic. Make sure you "pip install selenium" as well as download a chrome driver that matches your version of chrome and add it to your working directory before you run the script:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://zoko.cdm.depaul.edu/csc242/lists.html")
elements = driver.find_element_by_xpath("/html/body").text
print(elements)
Once captured all elements on the screen, you can then use whatever organization tactics you like to organize the text on the page. If organizing the elements into lists are what you are concerned about then I would recommend the following:
list = elements.split("\n")
print(list)
Let me know if you have any questions!
CodePudding user response:
i think the result you desire is an ordered list just like in the HTML,i think the code should be like this
from html.parser import HTMLParser
from urllib.parse import urljoin
from urllib.request import urlopen
def testLParser(url):
content = urlopen(url).read().decode()
parser = ListParser()
parser.feed(content)
return parser.getListItems()
class ListParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.f = 0
self.counter = 0
self.ul = []
self.ol = []
def handle_starttag(self, tag, attrs):
if tag == 'ol':
self.counter = 1
if tag == 'ul':
self.counter = 2
if tag != 'li':
return
if self.f:
self.f =1
return
self.f = 1
def handle_endtag(self, tag):
if tag == 'ol' or tag == 'ul':
self.counter = 0
if tag == 'li' and self.f:
self.f -= 1
def handle_data(self, data):
if self.f:
if self.counter == 1:
self.ul.append(data)
if self.counter == 2:
self.ol.append(data)
def getListItems(self):
new = self.ul self.ol
print(new) #this is what is causing me the only error that shows up right now
print(testLParser('http://zoko.cdm.depaul.edu/csc242/lists.html'))