Home > Net >  How to get rid of html tag line in my data list
How to get rid of html tag line in my data list

Time:12-30

I am writing a code to get all the professors emails from my university as web scraping practice. After what I currently have works I will pass the names through to get their individual pages and then their emails (not worried about that right now). My question is how I can stop the list of retrieved names from including their html data such as: <h4 >Nivea Canalli Bona</h4>, when all I want is "Nivea Canalli Bona"

Is there any way to do this that also makes my life easier when I run a for loop later on to get their individual pages?

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)

my_data = []

html = BeautifulSoup(data.text, 'html.parser')

for professor in html:

    name = html.select('h4.profile-card__name')

    my_data.append({"name": name})

pprint(my_data)

CodePudding user response:

To get the text from within a tag, call .text (or .get_text()).

Also, you shouldn't do this:

for professor in html:
    name = html.select('h4.profile-card__name')

you're just reiterating over the html and selecting all the data again and again. If you print the data you'll see this in action:

[<h4 >David Abel</h4>, <h4 >Maria Afzal</h4>, <h4 >Shweta Agarwal</h4>, <h4 >Michelle Amazeen</h4>, <h4 >Christopher Anderson</h4>, <h4 >Judith Austin</h4>, <h4 >John Baynard</h4>, <h4 >Larry Bean</h4>, <h4 >Christopher Beaudoin</h4>, <h4 >Lisa Liberty Becker</h4>, <h4 >Brooks Beisch</h4>, <h4 >Jerry Berger</h4>, <h4 >Tobe Berkovitz</h4>, <h4 >A. Sherrod Blakely</h4>, <h4 >Carter Blanchard</h4>, <h4 >Lisa Borden</h4>, <h4 >Adam Boyajy</h4>, <h4 >Bill Braudis</h4>, <h4 >Barry Brodsky</h4>, <h4 >Tatyana Bronstein</h4>, <h4 >Kathryn Burak</h4>, <h4 >Asad Butt</h4>, <h4 >Nivea Canalli Bona</h4>, <h4 >Susan Carlton</h4>]
[<h4 >David Abel</h4>, <h4 >Maria Afzal</h4>, <h4 >Shweta Agarwal</h4>, <h4 >Michelle Amazeen</h4>, <h4 >Christopher Anderson</h4>, <h4 >Judith Austin</h4>, <h4 >John Baynard</h4>, <h4 >Larry Bean</h4>, <h4 >Christopher Beaudoin</h4>, <h4 >Lisa Liberty Becker</h4>, <h4 >Brooks Beisch</h4>, <h4 >Jerry Berger</h4>, <h4 >Tobe Berkovitz</h4>, <h4 >A. Sherrod Blakely</h4>, <h4 >Carter Blanchard</h4>, <h4 >Lisa Borden</h4>, <h4 >Adam Boyajy</h4>, <h4 >Bill Braudis</h4>, <h4 >Barry Brodsky</h4>, <h4 >Tatyana Bronstein</h4>, <h4 >Kathryn Burak</h4>, <h4 >Asad Butt</h4>, <h4 >Nivea Canalli Bona</h4>, <h4 >Susan Carlton</h4>]
[<h4 >David Abel</h4>, <h4 >Maria Afzal</h4>, <h4 >Shweta Agarwal</h4>, <h4 >Michelle Amazeen</h4>, <h4 >Christopher Anderson</h4>, <h4 >Judith Austin</h4>, <h4 >John Baynard</h4>, <h4 >Larry Bean</h4>, <h4 >Christopher Beaudoin</h4>, <h4 >Lisa Liberty Becker</h4>, <h4 >Brooks Beisch</h4>, <h4 >Jerry Berger</h4>, <h4 >Tobe Berkovitz</h4>, <h4 >A. Sherrod Blakely</h4>, <h4 >Carter Blanchard</h4>, <h4 >Lisa Borden</h4>, <h4 >Adam Boyajy</h4>, <h4 >Bill Braudis</h4>, <h4 >Barry Brodsky</h4>, <h4 >Tatyana Bronstein</h4>, <h4 >Kathryn Burak</h4>, <h4 >Asad Butt</h4>, <h4 >Nivea Canalli Bona</h4>, <h4 >Susan Carlton</h4>]
[<h4 >David Abel</h4>, <h4 >Maria Afzal</h4>, <h4 >Shweta Agarwal</h4>, <h4 >Michelle Amazeen</h4>, <h4 >Christopher Anderson</h4>, <h4 >Judith Austin</h4>, <h4 >John Baynard</h4>, <h4 >Larry Bean</h4>, <h4 >Christopher Beaudoin</h4>, <h4 >Lisa Liberty Becker</h4>, <h4 >Brooks Beisch</h4>, <h4 >Jerry Berger</h4>, <h4 >Tobe Berkovitz</h4>, <h4 >A. Sherrod Blakely</h4>, <h4 >Carter Blanchard</h4>, <h4 >Lisa Borden</h4>, <h4 >Adam Boyajy</h4>, <h4 >Bill Braudis</h4>, <h4 >Barry Brodsky</h4>, <h4 >Tatyana Bronstein</h4>, <h4 >Kathryn Burak</h4>, <h4 >Asad Butt</h4>, <h4 >Nivea Canalli Bona</h4>, <h4 >Susan Carlton</h4>]

Instead, your code should look like this:

import requests
from bs4 import BeautifulSoup


url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)

my_data = []

html = BeautifulSoup(data.text, 'html.parser')

professors = html.select('h4.profile-card__name')

for professor in professors:
    my_data.append(professor.text)

print(my_data)

Prints:

['David Abel', 'Maria Afzal', 'Shweta Agarwal', 'Michelle Amazeen', 'Christopher Anderson', 'Judith Austin', 'John Baynard', 'Larry Bean', 'Christopher Beaudoin', 'Lisa Liberty Becker', 'Brooks Beisch', 'Jerry Berger', 'Tobe Berkovitz', 'A. Sherrod Blakely', 'Carter Blanchard', 'Lisa Borden', 'Adam Boyajy', 'Bill Braudis', 'Barry Brodsky', 'Tatyana Bronstein', 'Kathryn Burak', 'Asad Butt', 'Nivea Canalli Bona', 'Susan Carlton']

CodePudding user response:

You could use a regular expression (regex) to match a specific pattern:

import re
name_html = '<h4 >Nivea Canalli Bona</h4>'
print(re.match(r"<. >([A-z ] )<\/. >",name_html)[1])

Output:

Nivea Canalli Bona

Explanation:

  • The <. > part of the regular expression matches an opening tag; specifically a < character followed by more than 1 of any character (the . ) and a > character.
  • Then the next part is the name, which we want to capture so is wrapped in round brackets. The name is matched by more than one of any alphabetic character ([A-z]) and spaces.
  • Finally, the closing tag is matched which is identified by a < character as well as a / character (which is escaped with a backslash) followed by more than one of any character and a > character.

Let me know if this works. Hope this helps.

  • Related