Home > OS >  Finding text in html with seluim and bs4
Finding text in html with seluim and bs4

Time:08-26

I'm trying very hard to make a webscraping bot to retrieve my grades every hour. I have already coded the part where it logs in to the website but I can't figure out how to extract just the grade with bs4 and instead end up getting most of the page.

# Importing all modules
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup

# Onening myHIES through webdriver
driver=webdriver.Chrome("chromedriver.exe")
driver.get("https://hies.myschoolapp.com/app#login")
time.sleep(2.5)

# Logging in to myHIES then going on algebra grade page
driver.find_element(By.ID, "Username").send_keys("myemail")
driver.find_element(By.ID, "nextBtn").click()
time.sleep(4)
driver.find_element(By.ID, "i0118").send_keys("mypassword")
driver.find_element(By.ID, "idSIButton9").click()
time.sleep(2)
driver.find_element(By.ID, "idSIButton9").click()
print("*Breaths Lightly* WERE IN BABY!")
time.sleep(3.0)
driver.find_element(By.CSS_SELECTOR, "div#showHideGrade > div > label > span").click()
time.sleep(1.3)
driver.find_element(By.XPATH, '//*[@id="coursesContainer"]/div[1]/div[4]/a[1]').click()
print("handing off to bs4")
# Handing off manipulated page to bs4
page_source = driver.page_source

soup = BeautifulSoup(page_source, 'lxml')
print("handed off to bs4")
for tag in soup.find_all():
    print(tag.text)
print("should have printed tag text")

And the this is the html of where I am attempting to extract from

<div >        <div >            <h1>                69.00<span >%</span>            </h1>            <h6>marking period</h6>        </div>                    <div >                <h1>69.00<span >%</span></h1>                <h6>year</h6>            </div>            </div>

The code I'm trying to use to extract (again)

<div >        <div >            <h1>                69.00<span >%</span>            </h1>            <h6>marking period</h6>        </div>                    <div >                <h1>69.00<span >%</span></h1>                <h6>year</h6>            </div>            </div>

CodePudding user response:

You will need to mention specifically which tag you need to find, otherwise, find_all would return all tags. In your case, since the text you are looking for is in h1 tag, you will need to pass this to find_all.

for tag in soup.find_all("h1"):
print(tag.text)

If you wish to read more on find_all, please see this documentation.

CodePudding user response:

If provided html section is part of your soup you could try this:

....
main_div = soup.find('div', {'class': 'col-md-2'})
data_tags = main_div.find_all('h1')
data_notes = main_div.find_all('h6')

out_dct = {}
for i in range(2):
    grades = data_tags[i].text.replace(' ', '').replace('\t', '').split('\n')
    notes = data_notes[i].text.replace('\t', '').split('\n')
    out_dct['grade_'   str(i)] = grades
    out_dct['grade_'   str(i)].append(notes[0])
    
print(out_dct)

'''    R e s u l t :
{'grade_0': ['69.00%', 'marking period'], 'grade_1': ['69.00%', 'year']}
'''
  • Related