Home > Enterprise >  beautiful soup, looking for class isn't working returns only the table of content header?
beautiful soup, looking for class isn't working returns only the table of content header?

Time:11-05

So the idea is to scrap this particular page

inspect element view

however in my following code:

import requests
import json
from bs4 import BeautifulSoup

url = "https://www.perlego.com/book/921329/getting-started-with-python-understand-key-data-structures-and-use-python-in-objectoriented-programming-pdf?queryID=9315f2c9285af80efdc99eaa9c5621bc&index=prod_BOOKS&gridPosition=2"

    r = requests.get(url)
    
    print(r.status_code)
    
    soup = BeautifulSoup(r.content, 'html.parser')
    
    
    #another extra number on the side of sc-b81....-1 is the next link
    print(soup.find_all(attrs={'class': 'sc-b81fc1ca-0'}))

what is printed out by this function is

<div  data-testid="table-of-contents"><h2 >Table of contents</h2></div>]

whereas i would like all the tags under this class tag sc-b81fc1ca-2 although i've tried searching using findall but it only returns an empty list

CodePudding user response:

The content you're looking for only loads after some javascript runs on the page. This tutorial should help you get that javascript to run before performing your scapeing:

https://pythonprogramming.net/javascript-dynamic-scraping-parsing-beautiful-soup-tutorial/

CodePudding user response:

Table of contents tab is dynamic and clickable and get the desired content from here you click on it and you can do that by something like automation tool called selenium

Example:

from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import pandas as pd
from selenium.webdriver.common.by import By
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)

data = []
driver.get('https://www.perlego.com/book/921329/getting-started-with-python-understand-key-data-structures-and-use-python-in-objectoriented-programming-pdf?queryID=9315f2c9285af80efdc99eaa9c5621bc&index=prod_BOOKS&gridPosition=2')
driver.maximize_window()
time.sleep(6)

driver.find_element(By.XPATH, "(//*[contains(text(),'Table of contents')])[1]").click()
time.sleep(1)


soup = BeautifulSoup(driver.page_source,"html.parser")

txt = soup.select_one('div.sc-b81fc1ca-2.kydYov:-soup-contains("Contributors")').text
print(txt)

Output:

Contributors

CodePudding user response:

The data about TOC is loaded from external URL via JavaScript. You can use the requests/json modules to load it:

import json
import requests

# the number at the end of URL is ID of the boook:
ajax_url = "https://api.perlego.com/metadata/v2/metadata/books/toc/921329"
data = requests.get(ajax_url).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

def print_title(t, tabs=0):
    print("\t" * tabs, t["element_title"])
    for s in t.get("subchapters") or []:
        print_title(s, tabs   1)


for t in data["data"]["book_toc"]:
    print_title(t)

Prints:

 
 Title Page
 Copyright and Credits
     Getting Started with Python
 About Packt
     Why subscribe?
     Packt.com
 Contributors
     About the authors
     Packt is searching for authors like you
         
 Preface
     Who this book is for
     What this book covers
     To get the most out of this book
         Download the example code files
         Conventions used
     Get in touch
         Reviews
 A Gentle Introduction to Python
     A proper introduction
     Enter the Python
     About Python
         Portability
         Coherence
         Developer productivity
         An extensive library
         Software quality
         Software integration
         Satisfaction and enjoyment
     What are the drawbacks?
     Who is using Python today?
     Setting up the environment
         Python 2 versus Python 3
     Installing Python
         Setting up the Python interpreter
         About virtualenv
         Your first virtual environment
         Your friend, the console
     How you can run a Python program
         Running Python scripts
         Running the Python interactive shell
         Running Python as a service
         Running Python as a GUI application
     How is Python code organized?
         How do we use modules and packages?
     Python's execution model
         Names and namespaces
         Scopes
         Objects and classes
     Guidelines on how to write good code
     The Python culture
     A note on IDEs
     Summary
 Built-in Data Types
     Everything is an object
     Mutable or immutable? That is the question
     Numbers
         Integers
         Booleans
         Real numbers
         Complex numbers
         Fractions and decimals
     Immutable sequences
         Strings and bytes
             Encoding and decoding strings
             Indexing and slicing strings
             String formatting
         Tuples
     Mutable sequences
         Lists
         Byte arrays
     Set types
     Mapping types – dictionaries
     The collections module
         namedtuple
         defaultdict
         ChainMap
     Enums
     Final considerations
         Small values caching
         How to choose data structures
         About indexing and slicing
         About the names
     Summary
 Iterating and Making Decisions
     Conditional programming
         A specialized else – elif
         The ternary operator
     Looping
         The for loop
             Iterating over a range
             Iterating over a sequence
         Iterators and iterables
         Iterating over multiple sequences
         The while loop
         The break and continue statements
         A special else clause
     Putting all this together
         A prime generator
         Applying discounts
     A quick peek at the itertools module
         Infinite iterators
         Iterators terminating on the shortest input sequence
         Combinatoric generators
     Summary
 Functions, the Building Blocks of Code
     Why use functions?
         Reducing code duplication
         Splitting a complex task
         Hiding implementation details
         Improving readability
         Improving traceability
     Scopes and name resolution
         The global and nonlocal statements
     Input parameters
         Argument passing
         Assignment to argument names doesn't affect the caller
         Changing a mutable affects the caller
         How to specify input parameters
             Positional arguments
             Keyword arguments and default values
             Variable positional arguments
             Variable keyword arguments
             Keyword-only arguments
             Combining input parameters
             Additional unpacking generalizations
             Avoid the trap! Mutable defaults
     Return values
         Returning multiple values
     A few useful tips
     Recursive functions
     Anonymous functions
     Function attributes
     Built-in functions
     One final example
     Documenting your code
     Importing objects
         Relative imports
     Summary
 Files and Data Persistence
     Working with files and directories
         Opening files
             Using a context manager to open a file
         Reading and writing to a file
             Reading and writing in binary mode
             Protecting against overriding an existing file
         Checking for file and directory existence
         Manipulating files and directories
             Manipulating pathnames
         Temporary files and directories
         Directory content
         File and directory compression
     Data interchange formats
         Working with JSON
             Custom encoding/decoding with JSON
     IO, streams, and requests
         Using an in-memory stream
         Making HTTP requests
     Persisting data on disk
         Serializing data with pickle
         Saving data with shelve
         Saving data to a database
     Summary
 Principles of Algorithm Design
     Algorithm design paradigms
     Recursion and backtracking
         Backtracking
         Divide and conquer - long multiplication
         Can we do better? A recursive approach
     Runtime analysis
         Asymptotic analysis
         Big O notation
             Composing complexity classes
             Omega notation (Ω)
             Theta notation (ϴ)
     Amortized analysis
     Summary
 Lists and Pointer Structures
     Arrays
     Pointer structures
     Nodes
     Finding endpoints
         Node
             Other node types
     Singly linked lists
         Singly linked list class
         Append operation
     A faster append operation
     Getting the size of the list
     Improving list traversal
     Deleting nodes
         List search
     Clearing a list
     Doubly linked lists
         A doubly linked list node
             Doubly linked list
         Append operation
         Delete operation
         List search
     Circular lists
         Appending elements
         Deleting an element
             Iterating through a circular list
     Summary
 Stacks and Queues
     Stacks
         Stack implementation
         Push operation
         Pop operation
             Peek
         Bracket-matching application
     Queues
         List-based queue
             Enqueue operation
             Dequeue operation
         Stack-based queue
             Enqueue operation
             Dequeue operation
         Node-based queue
             Queue class
             Enqueue operation
             Dequeue operation
         Application of queues
             Media player queue
     Summary
 Trees
     Terminology
     Tree nodes
     Binary trees
         Binary search trees
         Binary search tree implementation
         Binary search tree operations
             Finding the minimum and maximum nodes
         Inserting nodes
         Deleting nodes
         Searching the tree
         Tree traversal
             Depth-first traversal
                 In-order traversal and infix notation
                 Pre-order traversal and prefix notation
                 Post-order traversal and postfix notation.
             Breadth-first traversal
         Benefits of a binary search tree
         Expression trees
             Parsing a reverse Polish expression
         Balancing trees
         Heaps
     Summary
 Hashing and Symbol Tables
     Hashing
         Perfect hashing functions
     Hash table
         Putting elements
         Getting elements
         Testing the hash table
         Using [] with the hash table
         Non-string keys
         Growing a hash table
         Open addressing
             Chaining
         Symbol tables
     Summary
 Graphs and Other Algorithms
     Graphs
     Directed and undirected graphs
     Weighted graphs
     Graph representation
         Adjacency list
         Adjacency matrix
     Graph traversal
         Breadth-first search
         Depth-first search
     Other useful graph methods
     Priority queues and heaps
         Inserting
         Pop
         Testing the heap
     Selection algorithms
     Summary
 Searching
     Linear Search
         Unordered linear search
         Ordered linear search
     Binary search
     Interpolation search
         Choosing a search algorithm
     Summary
 Sorting
     Sorting algorithms
     Bubble sort
     Insertion sort
     Selection sort
     Quick sort
         List partitioning
             Pivot selection
         Implementation
         Heap sort
     Summary
 Selection Algorithms
     Selection by sorting
     Randomized selection
         Quick select
             Partition step
     Deterministic selection
         Pivot selection
         Median of medians
         Partitioning step
     Summary
 Object-Oriented Design
     Introducing object-oriented
     Objects and classes
     Specifying attributes and behaviors
         Data describes objects
         Behaviors are actions
     Hiding details and creating the public interface
     Composition
     Inheritance
         Inheritance provides abstraction
         Multiple inheritance
     Case study
     Exercises
     Summary
 Objects in Python
     Creating Python classes
         Adding attributes
         Making it do something
             Talking to yourself
             More arguments
         Initializing the object
         Explaining yourself
     Modules and packages
         Organizing modules
             Absolute imports
             Relative imports
     Organizing module content
     Who can access my data?
     Third-party libraries
     Case study
     Exercises
     Summary
 When Objects Are Alike
     Basic inheritance
         Extending built-ins
         Overriding and super
     Multiple inheritance
         The diamond problem
         Different sets of arguments
     Polymorphism
     Abstract base classes
         Using an abstract base class
         Creating an abstract base class
         Demystifying the magic
     Case study
     Exercises
     Summary
 Expecting the Unexpected
     Raising exceptions
         Raising an exception
         The effects of an exception
         Handling exceptions
         The exception hierarchy
         Defining our own exceptions
     Case study
     Exercises
     Summary
 When to Use Object-Oriented Programming
     Treat objects as objects
     Adding behaviors to class data with properties
         Properties in detail
         Decorators – another way to create properties
         Deciding when to use properties
     Manager objects
         Removing duplicate code
         In practice
     Case study
     Exercises
     Summary
 Python Object-Oriented Shortcuts
     Python built-in functions
         The len() function
         Reversed
         Enumerate
         File I/O
         Placing it in context
     An alternative to method overloading
         Default arguments
         Variable argument lists
         Unpacking arguments
     Functions are objects too
         Using functions as attributes
         Callable objects
     Case study
     Exercises
     Summary
 The Iterator Pattern
     Design patterns in brief
     Iterators
         The iterator protocol
     Comprehensions
         List comprehensions
         Set and dictionary comprehensions
         Generator expressions
     Generators
         Yield items from another iterable
     Coroutines
         Back to log parsing
         Closing coroutines and throwing exceptions
         The relationship between coroutines, generators, and functions
     Case study
     Exercises
     Summary
 Python Design Patterns I
     The decorator pattern
         A decorator example
         Decorators in Python
     The observer pattern
         An observer example
     The strategy pattern
         A strategy example
         Strategy in Python
     The state pattern
         A state example
         State versus strategy
         State transition as coroutines
     The singleton pattern
         Singleton implementation
         Module variables can mimic singletons
     The template pattern
         A template example
     Exercises
     Summary
 Python Design Patterns II
     The adapter pattern
     The facade pattern
     The flyweight pattern
     The command pattern
     The abstract factory pattern
     The composite pattern
     Exercises
     Summary
 Testing Object-Oriented Programs
     Why test?
         Test-driven development
     Unit testing
         Assertion methods
         Reducing boilerplate and cleaning up
         Organizing and running tests
         Ignoring broken tests
     Testing with pytest
         One way to do setup and cleanup
         A completely different way to set up variables
         Skipping tests with pytest
     Imitating expensive objects
     How much testing is enough?
     Case study
         Implementing it
     Exercises
     Summary
 Other Books You May Enjoy
     Leave a review - let other readers know what you think
  • Related