Home > OS >  How to exclude Button text while scraping
How to exclude Button text while scraping

Time:08-04

I'm getting the names of every university in the USA from https://www.4icu.org/us/a-z/

I've successfully gotten all names with the following code:

import requests
from bs4 import BeautifulSoup

link = "https://www.4icu.org/us/a-z/"

html = requests.get(link).text

soup = BeautifulSoup(html, 'lxml')
for strainedsoup in soup.find_all('tr'):
   superstrainedsoup = strainedsoup.a.text
   print(superstrainedsoup)

however, the website contains a button at the bottom of the page and since I'm scraping from an a tag it also scrapes the text from the button showing this at the end:

...

Young Harris College

Youngstown State University

Add University

I'm not sure how to get rid of "Add University" without manually removing it once I add it to a text file, or using "limit=" if I want to use this code to scrape from other websites that have buttons not just at the end of the website.

To add to this, the reason the button text is being picked up in the first place is because it is inside the a tag like so:

<a href=...>
  <button type="button"> Add University</button>
</a>

CodePudding user response:

You have two options:

Remove the last element from the list via indexing:

for strainedsoup in soup.find_all('tr')[:-1]:

Or use a different css selector to not include the class of the row containing the button:

for strainedsoup in soup.select('tr:not(.small)'):

  • Related