So the idea is to scrap this particular page
however in my following code:
import requests
import json
from bs4 import BeautifulSoup
url = "https://www.perlego.com/book/921329/getting-started-with-python-understand-key-data-structures-and-use-python-in-objectoriented-programming-pdf?queryID=9315f2c9285af80efdc99eaa9c5621bc&index=prod_BOOKS&gridPosition=2"
r = requests.get(url)
print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
#another extra number on the side of sc-b81....-1 is the next link
print(soup.find_all(attrs={'class': 'sc-b81fc1ca-0'}))
what is printed out by this function is
<div data-testid="table-of-contents"><h2 >Table of contents</h2></div>]
whereas i would like all the tags under this class tag sc-b81fc1ca-2 although i've tried searching using findall but it only returns an empty list
CodePudding user response:
The content you're looking for only loads after some javascript runs on the page. This tutorial should help you get that javascript to run before performing your scapeing:
https://pythonprogramming.net/javascript-dynamic-scraping-parsing-beautiful-soup-tutorial/
CodePudding user response:
Table of contents
tab is dynamic and clickable and get the desired content from here you click on it and you can do that by something like automation tool called selenium
Example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import pandas as pd
from selenium.webdriver.common.by import By
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
data = []
driver.get('https://www.perlego.com/book/921329/getting-started-with-python-understand-key-data-structures-and-use-python-in-objectoriented-programming-pdf?queryID=9315f2c9285af80efdc99eaa9c5621bc&index=prod_BOOKS&gridPosition=2')
driver.maximize_window()
time.sleep(6)
driver.find_element(By.XPATH, "(//*[contains(text(),'Table of contents')])[1]").click()
time.sleep(1)
soup = BeautifulSoup(driver.page_source,"html.parser")
txt = soup.select_one('div.sc-b81fc1ca-2.kydYov:-soup-contains("Contributors")').text
print(txt)
Output:
Contributors
CodePudding user response:
The data about TOC is loaded from external URL via JavaScript. You can use the requests
/json
modules to load it:
import json
import requests
# the number at the end of URL is ID of the boook:
ajax_url = "https://api.perlego.com/metadata/v2/metadata/books/toc/921329"
data = requests.get(ajax_url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
def print_title(t, tabs=0):
print("\t" * tabs, t["element_title"])
for s in t.get("subchapters") or []:
print_title(s, tabs 1)
for t in data["data"]["book_toc"]:
print_title(t)
Prints:
Title Page
Copyright and Credits
Getting Started with Python
About Packt
Why subscribe?
Packt.com
Contributors
About the authors
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Reviews
A Gentle Introduction to Python
A proper introduction
Enter the Python
About Python
Portability
Coherence
Developer productivity
An extensive library
Software quality
Software integration
Satisfaction and enjoyment
What are the drawbacks?
Who is using Python today?
Setting up the environment
Python 2 versus Python 3
Installing Python
Setting up the Python interpreter
About virtualenv
Your first virtual environment
Your friend, the console
How you can run a Python program
Running Python scripts
Running the Python interactive shell
Running Python as a service
Running Python as a GUI application
How is Python code organized?
How do we use modules and packages?
Python's execution model
Names and namespaces
Scopes
Objects and classes
Guidelines on how to write good code
The Python culture
A note on IDEs
Summary
Built-in Data Types
Everything is an object
Mutable or immutable? That is the question
Numbers
Integers
Booleans
Real numbers
Complex numbers
Fractions and decimals
Immutable sequences
Strings and bytes
Encoding and decoding strings
Indexing and slicing strings
String formatting
Tuples
Mutable sequences
Lists
Byte arrays
Set types
Mapping types – dictionaries
The collections module
namedtuple
defaultdict
ChainMap
Enums
Final considerations
Small values caching
How to choose data structures
About indexing and slicing
About the names
Summary
Iterating and Making Decisions
Conditional programming
A specialized else – elif
The ternary operator
Looping
The for loop
Iterating over a range
Iterating over a sequence
Iterators and iterables
Iterating over multiple sequences
The while loop
The break and continue statements
A special else clause
Putting all this together
A prime generator
Applying discounts
A quick peek at the itertools module
Infinite iterators
Iterators terminating on the shortest input sequence
Combinatoric generators
Summary
Functions, the Building Blocks of Code
Why use functions?
Reducing code duplication
Splitting a complex task
Hiding implementation details
Improving readability
Improving traceability
Scopes and name resolution
The global and nonlocal statements
Input parameters
Argument passing
Assignment to argument names doesn't affect the caller
Changing a mutable affects the caller
How to specify input parameters
Positional arguments
Keyword arguments and default values
Variable positional arguments
Variable keyword arguments
Keyword-only arguments
Combining input parameters
Additional unpacking generalizations
Avoid the trap! Mutable defaults
Return values
Returning multiple values
A few useful tips
Recursive functions
Anonymous functions
Function attributes
Built-in functions
One final example
Documenting your code
Importing objects
Relative imports
Summary
Files and Data Persistence
Working with files and directories
Opening files
Using a context manager to open a file
Reading and writing to a file
Reading and writing in binary mode
Protecting against overriding an existing file
Checking for file and directory existence
Manipulating files and directories
Manipulating pathnames
Temporary files and directories
Directory content
File and directory compression
Data interchange formats
Working with JSON
Custom encoding/decoding with JSON
IO, streams, and requests
Using an in-memory stream
Making HTTP requests
Persisting data on disk
Serializing data with pickle
Saving data with shelve
Saving data to a database
Summary
Principles of Algorithm Design
Algorithm design paradigms
Recursion and backtracking
Backtracking
Divide and conquer - long multiplication
Can we do better? A recursive approach
Runtime analysis
Asymptotic analysis
Big O notation
Composing complexity classes
Omega notation (Ω)
Theta notation (ϴ)
Amortized analysis
Summary
Lists and Pointer Structures
Arrays
Pointer structures
Nodes
Finding endpoints
Node
Other node types
Singly linked lists
Singly linked list class
Append operation
A faster append operation
Getting the size of the list
Improving list traversal
Deleting nodes
List search
Clearing a list
Doubly linked lists
A doubly linked list node
Doubly linked list
Append operation
Delete operation
List search
Circular lists
Appending elements
Deleting an element
Iterating through a circular list
Summary
Stacks and Queues
Stacks
Stack implementation
Push operation
Pop operation
Peek
Bracket-matching application
Queues
List-based queue
Enqueue operation
Dequeue operation
Stack-based queue
Enqueue operation
Dequeue operation
Node-based queue
Queue class
Enqueue operation
Dequeue operation
Application of queues
Media player queue
Summary
Trees
Terminology
Tree nodes
Binary trees
Binary search trees
Binary search tree implementation
Binary search tree operations
Finding the minimum and maximum nodes
Inserting nodes
Deleting nodes
Searching the tree
Tree traversal
Depth-first traversal
In-order traversal and infix notation
Pre-order traversal and prefix notation
Post-order traversal and postfix notation.
Breadth-first traversal
Benefits of a binary search tree
Expression trees
Parsing a reverse Polish expression
Balancing trees
Heaps
Summary
Hashing and Symbol Tables
Hashing
Perfect hashing functions
Hash table
Putting elements
Getting elements
Testing the hash table
Using [] with the hash table
Non-string keys
Growing a hash table
Open addressing
Chaining
Symbol tables
Summary
Graphs and Other Algorithms
Graphs
Directed and undirected graphs
Weighted graphs
Graph representation
Adjacency list
Adjacency matrix
Graph traversal
Breadth-first search
Depth-first search
Other useful graph methods
Priority queues and heaps
Inserting
Pop
Testing the heap
Selection algorithms
Summary
Searching
Linear Search
Unordered linear search
Ordered linear search
Binary search
Interpolation search
Choosing a search algorithm
Summary
Sorting
Sorting algorithms
Bubble sort
Insertion sort
Selection sort
Quick sort
List partitioning
Pivot selection
Implementation
Heap sort
Summary
Selection Algorithms
Selection by sorting
Randomized selection
Quick select
Partition step
Deterministic selection
Pivot selection
Median of medians
Partitioning step
Summary
Object-Oriented Design
Introducing object-oriented
Objects and classes
Specifying attributes and behaviors
Data describes objects
Behaviors are actions
Hiding details and creating the public interface
Composition
Inheritance
Inheritance provides abstraction
Multiple inheritance
Case study
Exercises
Summary
Objects in Python
Creating Python classes
Adding attributes
Making it do something
Talking to yourself
More arguments
Initializing the object
Explaining yourself
Modules and packages
Organizing modules
Absolute imports
Relative imports
Organizing module content
Who can access my data?
Third-party libraries
Case study
Exercises
Summary
When Objects Are Alike
Basic inheritance
Extending built-ins
Overriding and super
Multiple inheritance
The diamond problem
Different sets of arguments
Polymorphism
Abstract base classes
Using an abstract base class
Creating an abstract base class
Demystifying the magic
Case study
Exercises
Summary
Expecting the Unexpected
Raising exceptions
Raising an exception
The effects of an exception
Handling exceptions
The exception hierarchy
Defining our own exceptions
Case study
Exercises
Summary
When to Use Object-Oriented Programming
Treat objects as objects
Adding behaviors to class data with properties
Properties in detail
Decorators – another way to create properties
Deciding when to use properties
Manager objects
Removing duplicate code
In practice
Case study
Exercises
Summary
Python Object-Oriented Shortcuts
Python built-in functions
The len() function
Reversed
Enumerate
File I/O
Placing it in context
An alternative to method overloading
Default arguments
Variable argument lists
Unpacking arguments
Functions are objects too
Using functions as attributes
Callable objects
Case study
Exercises
Summary
The Iterator Pattern
Design patterns in brief
Iterators
The iterator protocol
Comprehensions
List comprehensions
Set and dictionary comprehensions
Generator expressions
Generators
Yield items from another iterable
Coroutines
Back to log parsing
Closing coroutines and throwing exceptions
The relationship between coroutines, generators, and functions
Case study
Exercises
Summary
Python Design Patterns I
The decorator pattern
A decorator example
Decorators in Python
The observer pattern
An observer example
The strategy pattern
A strategy example
Strategy in Python
The state pattern
A state example
State versus strategy
State transition as coroutines
The singleton pattern
Singleton implementation
Module variables can mimic singletons
The template pattern
A template example
Exercises
Summary
Python Design Patterns II
The adapter pattern
The facade pattern
The flyweight pattern
The command pattern
The abstract factory pattern
The composite pattern
Exercises
Summary
Testing Object-Oriented Programs
Why test?
Test-driven development
Unit testing
Assertion methods
Reducing boilerplate and cleaning up
Organizing and running tests
Ignoring broken tests
Testing with pytest
One way to do setup and cleanup
A completely different way to set up variables
Skipping tests with pytest
Imitating expensive objects
How much testing is enough?
Case study
Implementing it
Exercises
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think