How can I scrape a website if the attributes are randomized?-CodePudding

I am trying to scrape this website:

CodePudding user response：

If you inspect the page, you'll see that the text of the book is actually under the class:

So, target that class.

Now you can't just use:

 soup.find_all(class_="content-book my-4")

since that would give us unnecessary <script> tags:

<div ><p> <strong>Chapter 2 Sick Feeling</strong></p><p> Scarlett’s POV:</p><p> “Anything else?” I asked in disbelief.</p><p> “We have to get up early to see Rita tomorrow,” Charles replied coldly.</p><p> “Okay.”</p><p> I was confused. I could not help but wonder if he returned just to make a point.</p><p> “I’ll sleep here tonight,” he added.</p><p> I came to my senses the instant I heard what he had said. I wanted to ask him if it was really okay for

So, instead, use a CSS selector:

for element in soup.select(".content-book.my-4 p"):
    print(element.text)

This will select a <p> under the class of content-book my-4. (This is for Chapter 2, but it still works on chapter 1).

import requests
from bs4 import BeautifulSoup


URL = "https://novel5s.com/bye-my-irresistible-love-by-goreous-novel5-online-2138/148982.html"

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

for element in soup.select(".content-book.my-4 p"):
    print(element.text)

Output:

 Chapter 2 Sick Feeling
 Scarlett’s POV:
 “Anything else?” I asked in disbelief.
 “We have to get up early to see Rita tomorrow,” Charles replied coldly.
 “Okay.”
 I was confused. I could not help but wonder if he returned just to make a point.
 “I’ll sleep here tonight,” he added.
 I came to my senses the instant I heard what he had said. I wanted to ask him if it was really okay for him to stay here, but I decided to swallow my words instead.
 “I’m afraid you’ll oversleep because of the jet lag,” he 
...

CodePudding user response：

The order of the hidden text seems to be encoded in the style element in the webpage html, just below the div element containing all paragraphs (see screenshot).

The codes in this style element seem to correspond to the class and randomized tags in the paragraph elements that you have trouble with parsing.

My suggestion would be to parse this style element, extract the classes and tags in the right order, and parse those from the paragraph elements to get the complete paragraphs.

It would still require some parsing and decoding, but I hope this helps!

Screenshot: The element that presumably encodes the text order contained in randomized tags