Home > Software design >  How to scrape list of titles from a webpage?
How to scrape list of titles from a webpage?

Time:08-28

I am trying to scrape the list of courses available on Udacity Website

enter image description here

CodePudding user response:

The issue with just creating a soup using the initial html content is that that site reasonably doesn't load everything at once and places additional courses dynamically possibly to have a lower initial page load time. To solve this you can use something like Selenium for Python.

Then, we'll use CSS Selectors to select h2 elements with a class attribute containing "card_title" (I viewed the source on that site and it looks like that's how courses are displayed).

You'll need to download a driver for Selenium, I'm using Chrome on Windows here so I downloaded chromedriver.exe from the list of available drivers (ChromeDriver 104.0.5112.79) for the latest stable release.

Example code:

from bs4 import BeautifulSoup
from selenium import webdriver    

options = webdriver.ChromeOptions()
options.add_argument('--headless')

# I'm using Chrome in this example, you can search online for more on
# how Selenium works. This executable path points to where I downloaded it
browser = webdriver.Chrome(options=options, executable_path=r'C:\Users\User\Downloads\chromedriver_win32\chromedriver.exe')
browser.get("https://www.udacity.com/courses/all")

html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')

# match h2 elements with a class containing "card_title"
for course in soup.select('h2[class*="card_title"]'):
    course_name = course.get_text()
    # do something with course_name, e.g add it to a list
    print(course_name)

browser.quit()

Output:

Data Engineer
Business Analytics
Product Manager
Programming for Data Science with Python
Introduction to Programming
Data Scientist
Data Analyst
C  
React
Blockchain Developer
Self-Driving Car Engineer
Machine Learning DevOps Engineer
Deep Learning
SQL
Front End Web Developer
Full Stack Web Developer
Java Programming
Digital Marketing
Artificial Intelligence for Trading
Data Structures and Algorithms
UX Designer
Java Developer
AWS Machine Learning Engineer
Intermediate Python
AI Programming with Python
Growth Product Manager
Intro to Self-Driving Cars
Cloud DevOps Engineer
Robotics Software Engineer
Deep Reinforcement Learning
Data Architect
Android Kotlin Developer
Computer Vision
Data Analysis and Visualization with Microsoft Power BI
Natural Language Processing
Cloud Developer
Zero Trust Security
Data Streaming
AI Product Manager
Introduction to Cybersecurity
iOS Developer
Data Engineering with Microsoft Azure
Intro to Machine Learning with TensorFlow
AWS Cloud Architect
Full Stack JavaScript Developer
Digital Project Management
Cloud Native Application Architecture
Intro to Machine Learning with PyTorch
Data Product Manager
Flying Car and Autonomous Flight Engineer
Sensor Fusion Engineer
Ethical Hacker
Predictive Analytics For Business
Intermediate JavaScript
Android Basics
Artificial Intelligence
Agile Software Development
Marketing Analytics
Data Visualization
Cloud DevOps using Microsoft Azure
Digital Freelancer
AI for Healthcare
Hybrid Cloud Engineer
Data Science for Business Leaders
AI for Business Leaders
Privacy Engineer
Site Reliability Engineer
Security Engineer
Cloud Developer using Microsoft Azure
Cloud Architect using Microsoft Azure
Machine Learning Engineer for Microsoft Azure
Security Architect
AI Engineer using Microsoft Azure
Data Privacy
Security Analyst
Enterprise Security
Intel® Edge AI for IoT Developers
Cloud Computing for Business Leaders
Programming for Data Science with R
RPA Developer with UiPath
Cybersecurity for Business Leaders
Intro to Information Security
Cyber-Physical Systems Security
Network Security
Getting Started with Google Workspace
Rapid Prototyping
Creating an Analytical Dataset
Problem Solving with Advanced Analytics
Classification Models
Product Design
Segmentation and Clustering
Time Series Forecasting
App Marketing
App Monetization
A/B Testing for Business Analysts
How to Build a Startup
Get Your Startup Started
Managing Remote Teams with Upwork
Google Cloud Digital Leader Training
Cloud Native Fundamentals
Hybrid Cloud Fundamentals
Intro to Data Analysis
SQL for Data Analysis
Database Systems Concepts & Design
Intro to Inferential Statistics
Spark
Data Analysis and Visualization
Cyber-Physical Systems Design & Analysis
Differential Equations in Action
Self-Driving Fundamentals: Featuring Apollo
AWS Machine Learning Foundations Course
Introduction to Machine Learning using Microsoft Azure
AI Fundamentals
Linear Algebra Refresher Course
Machine Learning: Unsupervised Learning
Big Data Analytics in Healthcare
Intel® Edge AI Fundamentals with OpenVINO™
Artificial Intelligence
Secure and Private AI
Model Building and Validation
Data Visualization and D3.js
Machine Learning for Trading
Machine Learning
Intro to Hadoop and MapReduce
Real-Time Analytics with Apache Storm
A/B Testing
Data Analysis with R
Knowledge-Based AI: Cognitive Systems
Introduction to TensorFlow Lite
Introduction to Computer Vision
Intro to TensorFlow for Deep Learning
Eigenvectors and Eigenvalues
Intro to Artificial Intelligence
Artificial Intelligence for Robotics
Intro to Deep Learning with PyTorch
AWS DeepRacer
Reinforcement Learning
Introduction to Machine Learning Course
Product Manager Interview Preparation
Microsoft Power Platform
Web Tooling & Automation
Front End Frameworks
Responsive Web Design Fundamentals
How to Install Android Studio
Android Basics: Multiscreen Apps
Website Performance Optimization
iOS Networking with Swift
JavaScript Design Patterns
Android Basics: User Input
Android Performance
Responsive Images
Xcode Debugging
Gradle for Android and Java
Build Native Mobile Apps with Flutter
JavaScript Promises
UIKit Fundamentals
Android Basics: User Interface
Client-Server Communication
What is Programming?
Building High Conversion Web Forms
Advanced Android App Development
Software Architecture & Design
Authentication & Authorization: OAuth
Intro to iOS App Development with Swift
Introduction to Operating Systems
Android Basics: Networking
Web Accessibility
Android Basics: Data Storage
Scalable Microservices with Kubernetes
Developing Android Apps with Kotlin
Browser Rendering Optimization
Learn Swift Programming Syntax
Offline Web Applications
Kotlin for Android Developers
UX Design for Mobile Developers
Software Development Process
Data Visualization in Tableau
Intro to Progressive Web Apps
Writing READMEs
Software Analysis & Testing
iOS Persistence and Core Data
Computer Networking
Firebase Analytics: iOS
Human-Computer Interaction
2D Game Development with libGDX
Intro to jQuery
How to create <anything> in Android
Introduction to Graduate Algorithms
Dynamic Web Applications with Sinatra
How to Make a Platformer Using libGDX
JavaScript Testing
Object-Oriented JavaScript
Localization Essentials
Compilers: Theory and Practice
HTML5 Canvas
Object Oriented Programming in Java
Designing RESTful APIs
GT - Refresher - Advanced OS
Intro to JavaScript
Grand Central Dispatch (GCD)
Continuous Integration and Deployment
Swift for Beginners
Intro to Statistics
Intro to HTML and CSS
Developing Android Apps
Introduction to Python Programming
Introduction to Virtual Reality
Objective-C for Swift Developers
Interactive 3D Graphics
Full Stack Foundations
High Performance Computer Architecture
AutoLayout
Kotlin Bootcamp for Programmers
Shell Workshop
Core ML: Machine Learning for iOS
Statistics
Intro to Theoretical Computer Science
Design of Computer Programs
Data Wrangling with MongoDB
Swift for Developers
Firebase in a Weekend: Android
Software Debugging
Deploying a Hadoop Cluster
Server-Side Swift
Networking for Web Developers
Intro to Physics
Intro to Relational Databases
ES6 - JavaScript Improved
Mobile Design and Usability for iOS
Intro to AJAX
Intro to Algorithms
The MVC Pattern in Ruby
WeChat Mini Program Development
Asynchronous JavaScript Requests
Embedded Systems
High Performance Computing
HTTP & Web Servers
Advanced Android with Kotlin
Computability, Complexity & Algorithms
Advanced Operating Systems
Passwordless Login Solutions for iOS
Version Control with Git
Firebase in a Weekend: iOS
Intro to Point & Click App Development
Deploying Applications with Heroku
Applied Cryptography
Java Programming Basics
C   For Programmers
Intro to Backend
JavaScript and the DOM
Firebase Analytics: Android
Configuring Linux Web Servers
How to Make an iOS App
Intro to DevOps
Google Maps APIs
Passwordless Login Solutions for Android
Mobile Design and Usability for Android
iOS Design Patterns
Intro to Psychology
Engagement & Monetization | Mobile Games
Material Design for Android Developers
Craft Your Cover Letter
Refresh Your Resume
Strengthen Your LinkedIn Network & Brand
Data Science Interview Prep
Android Interview Prep
Machine Learning Interview Preparation
Front-End Interview Prep
Full-Stack Interview Prep
Data Structures & Algorithms in Swift
iOS Interview Prep
VR Interview Prep

CodePudding user response:

The webpage is loaded runtime JavaScript.Bs4 can't render/parse such dynamic content.So you can mimic all data using selenim with bs4 as follows:

Example:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time

webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url = 'https://www.udacity.com/courses/all'
driver.get(url)
driver.maximize_window()
time.sleep(3)

soup=BeautifulSoup(driver.page_source, 'lxml')

lst =[]
for card in soup.find_all("a", class_= "card_container__25DrK"):
    title = card.select_one('h2.card_title__35G97').text
    lst.append(title)
print(lst)

Output:

['Data Engineer', 'Business Analytics', 'Product Manager', 'Programming for Data Science with Python', 'Introduction to Programming', 'Data Scientist', 'Data Analyst', 'C  ', 'React', 'Blockchain Developer', 'Self-Driving Car Engineer', 'Machine Learning DevOps Engineer', 'Deep Learning', 'SQL', 'Front End Web Developer', 'Full Stack Web Developer', 'Java Programming', 'Digital Marketing', 'Artificial Intelligence for Trading', 'Data Structures and Algorithms', 'UX Designer', 'Java Developer', 'AWS Machine Learning Engineer', 'Intermediate Python', 'AI Programming with Python', 'Growth Product Manager', 'Intro to Self-Driving Cars', 'Cloud DevOps Engineer', 'Robotics Software Engineer', 
'Deep Reinforcement Learning', 'Data Architect', 'Android Kotlin Developer', 'Computer Vision', 'Data Analysis and Visualization with Microsoft Power BI', 'Natural Language Processing', 'Cloud Developer', 'Zero Trust Security', 'Data Streaming', 'AI Product Manager', 'Introduction to Cybersecurity', 'iOS Developer', 'Data Engineering with Microsoft Azure', 'Intro to Machine Learning with TensorFlow', 'AWS Cloud Architect', 'Full Stack JavaScript Developer', 'Digital Project Management', 'Cloud Native Application Architecture', 'Intro to Machine Learning with PyTorch', 'Data Product Manager', 'Flying Car and Autonomous Flight Engineer', 'Sensor Fusion Engineer', 'Ethical Hacker', 'Predictive Analytics For Business', 'Intermediate JavaScript', 'Android Basics', 'Artificial Intelligence', 'Agile Software Development', 'Marketing Analytics', 'Data Visualization', 'Cloud DevOps using Microsoft Azure', 'Digital Freelancer', 'AI for Healthcare', 'Hybrid Cloud Engineer', 'Data Science for Business Leaders', 'AI for Business Leaders', 'Privacy Engineer', 'Site Reliability Engineer', 'Security Engineer', 'Cloud Developer using Microsoft Azure', 'Cloud 
Architect using Microsoft Azure', 'Machine Learning Engineer for Microsoft Azure', 'Security Architect', 'AI Engineer using Microsoft Azure', 'Data Privacy', 'Security Analyst', 'Enterprise Security', 'Intel® Edge AI for IoT Developers', 'Cloud Computing for Business Leaders', 'Programming for Data Science with R', 'RPA Developer with UiPath', 'Cybersecurity for Business Leaders', 'Intro to Information Security', 'Cyber-Physical Systems Security', 'Network Security', 'Getting Started with Google Workspace', 'Rapid Prototyping', 'Creating an Analytical Dataset', 'Problem Solving with Advanced Analytics', 'Classification Models', 'Product Design', 'Segmentation and Clustering', 'Time Series Forecasting', 'App Marketing', 'App Monetization', 'A/B Testing for Business Analysts', 'How to Build a Startup', 
'Get Your Startup Started', 'Managing Remote Teams with Upwork', 'Google Cloud Digital Leader Training', 'Cloud Native Fundamentals', 'Hybrid Cloud Fundamentals', 'Intro to Data Analysis', 'SQL for Data Analysis', 'Database Systems 
Concepts & Design', 'Intro to Inferential Statistics', 'Spark', 'Data Analysis and Visualization', 'Cyber-Physical Systems Design & Analysis', 'Differential Equations in Action', 'Self-Driving Fundamentals: Featuring Apollo ', 'AWS 
Machine Learning Foundations Course', 'Introduction to Machine Learning using Microsoft Azure', 'AI Fundamentals', 'Linear Algebra Refresher Course', 'Machine Learning: Unsupervised Learning', 'Big Data Analytics in Healthcare', 'Intel® Edge AI Fundamentals with OpenVINO™', 'Artificial Intelligence', 'Secure and Private AI', 'Model Building and Validation', 'Data Visualization and D3.js', 'Machine Learning for Trading', 'Machine Learning', 'Intro to Hadoop and MapReduce', 'Real-Time Analytics with Apache Storm', 'A/B Testing', 'Data Analysis with R', 'Knowledge-Based AI: Cognitive Systems', 'Introduction to TensorFlow Lite', 'Introduction to Computer Vision', 'Intro to TensorFlow for Deep Learning', 'Eigenvectors and Eigenvalues', 'Intro to Artificial Intelligence', 'Artificial Intelligence for Robotics', 'Intro to Deep Learning with PyTorch', 'AWS DeepRacer', 'Reinforcement Learning', 'Introduction to Machine Learning Course', 'Product Manager Interview Preparation', 'Microsoft Power Platform', 'Web Tooling & Automation', 'Front End Frameworks', 'Responsive Web Design Fundamentals', 'How to Install Android Studio', 'Android Basics: Multiscreen Apps', 'Website Performance Optimization', 'iOS Networking with Swift', 'JavaScript Design Patterns', 'Android Basics: User Input', 'Android Performance', 'Responsive Images', 'Xcode Debugging', 'Gradle for Android and Java', 'Build Native Mobile Apps with Flutter', 'JavaScript Promises', 'UIKit Fundamentals', 'Android Basics: User Interface', 'Client-Server Communication', 'What is Programming?', 'Building High Conversion Web Forms', 'Advanced Android App 
Development', 'Software Architecture & Design', 'Authentication & Authorization: OAuth', 'Intro to iOS App Development with Swift', 'Introduction to Operating Systems', 'Android Basics: Networking', 'Web Accessibility', 'Android Basics: Data Storage', 'Scalable Microservices with Kubernetes', 'Developing Android Apps with Kotlin', 'Browser Rendering Optimization', 'Learn Swift Programming Syntax', 'Offline Web Applications', 'Kotlin for Android Developers', 'UX Design for Mobile Developers', 'Software Development Process', 'Data Visualization in Tableau', 'Intro to Progressive Web Apps', 'Writing READMEs', 'Software Analysis & Testing', 'iOS Persistence and Core Data', 'Computer Networking', 'Firebase Analytics: iOS', 'Human-Computer Interaction', '2D Game Development with libGDX', 'Intro to jQuery', 
'How to create <anything> in Android', 'Introduction to Graduate Algorithms', 'Dynamic Web Applications with Sinatra', 'How to Make a Platformer Using libGDX', 'JavaScript Testing', 'Object-Oriented JavaScript', 'Localization Essentials', 'Compilers: Theory and Practice', 'HTML5 Canvas', 'Object Oriented Programming in Java', 'Designing RESTful APIs', 'GT - Refresher - Advanced OS', 'Intro to JavaScript', 'Grand Central Dispatch (GCD)', 'Continuous Integration and Deployment', 'Swift for Beginners', 'Intro to Statistics', 'Intro to HTML and CSS', 'Developing Android Apps', 
'Introduction to Python Programming', 'Introduction to Virtual Reality', 'Objective-C for Swift Developers', 'Interactive 3D Graphics', 'Full Stack Foundations', 'High Performance Computer Architecture', 'AutoLayout', 'Kotlin Bootcamp for Programmers', 'Shell Workshop', 'Core ML: Machine Learning for iOS', 'Statistics', 'Intro to Theoretical Computer Science', 'Design of Computer Programs', 'Data Wrangling with MongoDB', 'Swift for Developers', 'Firebase in a 
Weekend: Android', 'Software Debugging', 'Deploying a Hadoop Cluster', 'Server-Side Swift', 'Networking for Web Developers', 'Intro to Physics', 'Intro to Relational Databases', 'ES6 - JavaScript Improved', 'Mobile Design and Usability for iOS', 'Intro to AJAX', 'Intro to Algorithms', 'The MVC Pattern in Ruby', 'WeChat Mini Program Development', 
'Asynchronous JavaScript Requests', 'Embedded Systems', 'High Performance Computing', 'HTTP & Web Servers', 'Advanced Android with Kotlin', 'Computability, Complexity & Algorithms', 'Advanced Operating Systems', 'Passwordless Login 
Solutions for iOS', 'Version Control with Git', 'Firebase in a Weekend: iOS', 'Intro to Point & Click App Development', 'Deploying Applications with Heroku', 'Applied Cryptography', 'Java Programming Basics', 'C   For Programmers', 
'Intro to Backend', 'JavaScript and the DOM', 'Firebase Analytics: Android', 'Configuring Linux Web Servers', 'How to Make an iOS App', 'Intro to DevOps', 'Google Maps APIs', 'Passwordless Login Solutions for Android', 'Mobile Design and Usability for Android', 'iOS Design Patterns', 'Intro to Psychology', 'Engagement & Monetization | Mobile Games', 'Material Design for Android Developers', 'Craft Your Cover Letter', 'Refresh Your Resume', 'Strengthen Your LinkedIn Network & Brand', 'Data Science Interview Prep', 'Android Interview Prep', 'Machine Learning Interview Preparation', 'Front-End Interview Prep', 'Full-Stack Interview Prep', 'Data Structures & Algorithms in Swift', 'iOS Interview Prep', 'VR Interview Prep']

CodePudding user response:

The main issue here is that BeautifulSoup by itself only performs static scraping i.e. gets just the static HTML. You will need to use something like Selenium with BeautifulSoup to scrape dynamically generated HTML.

You may find the following tutorial useful: WebScraping with BeautifulSoup and Selenium

Additionally, you should also ensure the correct tag is being targeted. For example, in your screen-shot, the target is an anchor tag so your find_all should be as follows:

name = soup.find_all('a', class_='card_container__25DrK')

However, do check the HTML retrieved by your program to make sure you are targeting the correct tag and specifying the correct attribute value.

  • Related