Home > other >  Speeding up potentially large chained BeautifulSoup tasks
Speeding up potentially large chained BeautifulSoup tasks

Time:08-13

I'm very new to web scraping (I know next to nothing about html and this is my first time using BeautifulSoup) and i'm making a program that essentially lets me generate PDFs or epubs for novels online. I'm not worried about compatibility with a wide variety of sites, since I'm just making this for me. I made the code that gets the links for all the chapters of the webnovel from any link for that specific chapter and put's them all into a list, however this takes a long time. Somewhere around a second for each link. Given that some novels are literally upwards of 1-2 thousand chapters, that's like half an hour just to get all the links, and the program hasn't even gotten the body text of each links and compiled them into PDFs, is there a way I can make this code faster?

import requests
from bs4 import BeautifulSoup
def list_chapters():
    given_chapter = 'https://www.box-novel.com/novel/cannon-fodder-counterattack-system/chapter-4-1/'
    current_chapter = find_first_chapter(given_chapter)
    print("Starting chapter: ", current_chapter)
    link_list = []
    try:
        while True:
            link_list.append(current_chapter)
            r = requests.get(current_chapter)
            soup = BeautifulSoup(r.content, 'html.parser')
            s = soup.find('div', class_='nav-next')
            for link in s.find_all('a'):
                current_chapter = link.get('href')
    except AttributeError:
        link_list.pop(-1)
        print(len(link_list), "chapters detected.")

Please let me know other ways to improve my code as well. note: I pop the last value in the link because it's easier than detecting when the nav-next value is for manga-info which what is referenced in nav-next on the last chapter, also ignore the random trash novel link I used, it's the shortest one I could find on the first page.

CodePudding user response:

If one request at the time is too long, we should fire multiple of them at the same time!

How? Well, there are multiple options, but I'd stick to aiohttp library, which does what requests does, but asynchronously.

Here's some example of using it which I totally stole from another question:

import asyncio
import aiohttp
import time

websites = """https://www.youtube.com
http://www.chrome.com
http://www.booking.com
http://www.googleusercontent.com
http://www.google.com.au
http://www.popads.net
http://www.cntv.cn"""


async def get(url, session):
    try:
        async with session.get(url=url) as response:
            resp = await response.read()
            print("Successfully got url {} with resp of length {}.".format(url, len(resp)))
    except Exception as e:
        print("Unable to get url {} due to {}.".format(url, e.__class__))


async def main(urls):
    async with aiohttp.ClientSession() as session:
        ret = await asyncio.gather(*[get(url, session) for url in urls])
    print("Finalized all. Return is a list of len {} outputs.".format(len(ret)))


urls = websites.split("\n")
start = time.time()
asyncio.run(main(urls))
end = time.time()

print("Took {} seconds to pull {} websites.".format(end - start, len(urls)))

CodePudding user response:

Your task is non-trivial. First, the links to all chapters are loaded via an ajax POST request in that entry-point page. After you sort that out, you need a robust async solution, and I mean something which can handle a 1BN links list, and can be executed on a Raspberry pi (so you need some concept of a queue). The following will take approx 10 seconds, and will return a dataframe with title and content for each of those 90 chapters from the novel (which you can then sort by title, if you want):

import asyncio
from httpx import Client, AsyncClient, Limits
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime


pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

## run this is you're executing the code in a notebook
import nest_asyncio
nest_asyncio.apply()

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
#### setup some sort of mock persistence ###
big_df_list = []

#### async scrape funcs ####
def all_chapters_urls():
    url_list = []
    payload = {
        'action': 'manga_get_reading_nav',
        'manga': '1987979',
        'chapter': 'chapter-29-7',
        'volume_id': '0',
        'type': 'content'
              }
    with Client(headers=headers, timeout=60.0, follow_redirects=True) as client:
        r = client.post('https://www.box-novel.com/wp-admin/admin-ajax.php', data = payload)
        soup = BeautifulSoup(r.text, 'html.parser')
        links = soup.select_one('select.c-selectpicker.selectpicker_chapter.selectpicker.single-chapter-select').select('option')
        for l in links:
            url_list.append(l.get('data-redirect'))
    return url_list
            
async def get_chapters(url):
    async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
        try:
            r = await client.get(url)
            soup = BeautifulSoup(r.text, 'html.parser')
            title = soup.select_one('h1#chapter-heading').get_text(strip=True)
            text_content = soup.select_one('div.text-left').get_text(strip=True)
            big_df_list.append((title, text_content))
        except Exception as e:
            print(url, e)

async def scrape_chapters():
    start_time = datetime.now()
    tasks = asyncio.Queue()
    for x in all_chapters_urls():
        tasks.put_nowait(get_chapters(x))

    async def worker():
        while not tasks.empty():
            await tasks.get_nowait()
            
    await asyncio.gather(*[worker() for _ in range(20)])
    end_time = datetime.now()
    duration = end_time - start_time
    print('chapters scraping took', duration)

asyncio.run(scrape_chapters())
df = pd.DataFrame(big_df_list, columns = ['Chapter', 'Content'])
print(df)

This will return in terminal:

chapters scraping took 0:00:10.991827
Chapter Content
0   Cannon Fodder Counterattack System - Chapter 30.1   The power of gossip was never been underestimated. Huang Dezheng’s reputation for kind and charismatic was far-reaching. His neighbours recognized him. The original impression of him was quite good, but he did not expect that he would be well-known not only in public but also in private. Especially messing about with your own students!Seeing his white and tender student being dragged by him, notice the way he couldn’t even walk properly. Hehe! What a scumbag!The gossipy neighbours recalled the scene they saw through their door’s peepholes and were still amazed. There was no way. At that time, the two of them were getting intimate, there was still energy to pay attention to whether the door was open, wasn’t there?Huang Dezheng did not notice this little detail when he left with Su Yibai in anger. The time he realized this, it was already several days later.The campus forum calmness of the past was swept away with an earthquake. The entire page layout was filled by posts with similar titles! Among them, the top one was the most eye-catching and popular!“During the 18th of August, School grass[1] Su and Teacher Huang’s cohabitation dog blood drama, here are the pictures and truth”Huang Dezheng, who was passing by his colleague’s computer, inadvertently caught a glimpse of this thick red line of words, and his heart jerked. He quietly held his breath as he returned to his office. His face paled as he entered into the forum he had previously scorned. With trembling hands, he opened the very hot post.“It is said that the landlord was shocked when he heard this. He was not familiar with the school, but the teacher Huang’s reputation in the school was very good. How could it be that he did not close the door and even did it with a student? What a scum?! But there are pictures of the truth, so it was not nonsense, the pictures are linked below.”“Fu*k! It turned out to be true!!!”“The soft and cute school grass together with the male god! Look at the hickey on the neck! Fu*k! It’s too intense! Teacher Huang bao dao wei lao[2]!!”“After examining the pictures, it truly hasn’t been photo-shopped… Fu*k! What a scumbag!!”“It should be true… School grass Su never returned to the dormitory and stayed outside, so it turned out…”“To help the landlord add fire, the photos were taken by a friend who went to the nightclub to play”” It turns out that Su Xuedi[3] is like this in private! Look at the half-covered chest, the creamy thighs! No wonder Teacher Huang This white flower has a half-covered chest and a chest, and the trough is still pink!! No wonder Huang teacher doesn’t love Jiangshan beauties!!”“Wow, there’s a reason the number of people who never go to class is so high. With these two pictures, it seems like our Su Xuedi’s eyes are not very good!”“…”Huang Dezheng looked at the increasingly unsightly text and pictures on the computer screen, his whole body was shaking in anger!Who was it?! Who did he offend for him to be framed so viciously?!He immediately left a message asking the moderator to delete the post, but it didn’t take long for the message that didn’t hide his identity to completely detonate the entire forum!Fu*k the person involved actually appeared!!!The forum was boiling with this additional drama and Huang Dezheng got so angry that his liver began to ache. Not only were the posts not deleted, but his message was even re-posted with screenshots!These students were really shameless![1]School grass: most handsome guy in school. For the opposite gender it would be school flower.[2] Bao dao wei loa: Old but still vigorous. I think that explains it.[3] Xuedi: junior or younger male school mate.(Visited 1 times, 1 visits today)
1   Cannon Fodder Counterattack System - Chapter 29.7   Qin Shiyue rushed back to the house without saying a word, he was tempted to blow up, but he was afraid of hurting the stupid rabbit, so he kept suppressing it.Ye Si Nian also did not say a word, and when he got home, he went into the bathroom without saying anything.The more he thought about the more frustrated he was! Qin Shiyue was tense like a trapped beast as he moved about in the study. The desk was already in chaos, and there were scattered documents on the floor.Just as his anger was reaching the apex, the study door was opened, and the stupid rabbit who had just taken a bath with a towel around his body leisurely walked in.His body was covered with a thin layer of tight and well-proportioned muscles. The skin was fair and smooth, the waist, thin but not weak. At first glance, it was full of explosive power.His eyes glided uncontrollably as he observed the man’s movement. Qin Shiyue was frozen in place, his heart almost stopped beating, and a thought flashed in his mind flashed that allowed him to recover his heartbeat whose speed soared to the limit.Ye Si Nian was getting closer and closer, and Qin Shiyue, who only had a theoretical experience, wanted to step forward into his (Ye Si Nian’s)arms, but Qin Shiyue’s brain was blank, and he didn’t know where to start…Intensely attracted to his lover who was stunned, he pressed his naked and exposed skin on the man’s thin shirt and gently rubbed on them.The man’s reaction was very interesting. Ye Si Nian pursed his lips and pushed the man slightly on his shoulder to make him sit down on the large chair.Smiling as Qin Shiyue raised his head to look up at him, Ye Si Nian’s index finger hooked up his chin and he bent to kiss the tense tightly-close thin lip.Effortlessly prying his lover’s lips open, Ye Si Nian invaded his soft tongue constantly wreaking havoc in Qin Shiyue’s mouth. He licked and played with Qin Shiyue’s sensitive mouth before his lover finally reacted.The breathing became more intense, his lover’s strength also increased, Ye Si Nian hummed and pulled away from Qin Shiyue’s mouth and gently licked his lower lip.“I want you, Qin Shiyue.”Looking at his lover’s suddenly large eyes, Ye Si Nian smiled smugly, kissing his earlobe and licking his ears he murmured slowly, “I want you… Qin Shiyue… I want you……”If one could hold back at this time, would he still be a man?!!Qin Shiyue slammed down Ye Si Nian’s thin waist, suppressing his desire. His voice was hoarse with craving, “Stupid rabbit, do you know that you are playing with fire?!”Ye Si Nian raised an eyebrow and replied to the question with action instead.(Visited 1 times, 1 visits today)
2   Cannon Fodder Counterattack System - Chapter 29.8   With his long leg stretched, Ye Si Nian sat on Qin Shiyue’s lap, lowering his head to nibble on his throat, he felt his slight trembling and repressed gasp. He flexibly untied his clothes and put his hands on the well-defined chest.No longer be a man!!Qin Shiyue made a beast-like roar and kissed Ye Si Nian’s fragile neck hard. The hands clinging behind him tore open Ye Si Nian’s towel.=======================The next afternoon Ye Si Nian sat up in bed sourly and examined the various traces all over his body. He was full of regrets.He really underestimated the enemy’s fighting power!The two personalities were frightening! They being virgins who were almost thirty years old was also dreadful! The combination of the two resulted in being tossed from yesterday afternoon to this morning was scary!!!When Qin Shiyue and Pei Yiyuan took turns in battle, who said that having a double personality was amazing? !!Complaining in his heart, Ye Si Nian saw the door being pushed open, and Pei Yiyuan came in with a gentle smile like a spring breeze.“Woken up? Are there any uncomfortable place in your body?” Pei Yiyuan went near the bed and knelt on one knee as he reached out and placed Ye Si Nian into his arms.“No.” Ye Si Nian gave a serious thought about it. He felt that the communication last night was really hearty and he enjoyed himself. It was normal for the muscles to be sore, and it was obvious that he was clean and dry now, so he decided to praise instead, “I felt very good last night!”“It will get better in the future!” The performance of the first time last night was affirmed. Pei YiYuan felt a little proud in his heart. He bowed to kiss Ye Si Nian’s lips. “Yes, Qin Shiyue wanted me to ask how you intend to deal with those two?”Speaking about the incident, the second personality was embarrassed to come out himself to ask. Ye Si Nian’s lips twitched and said: “I decided to sell the apartment.”“That’s it?” Pei Yiyuan raised his eyebrows, he also had no good feelings for the two people.“Don’t underestimate the power of gossip…” Ye Si Nian shook his head with a smile and said, “Otherwise, you just wait and see! Without me, they are well able to kill themselves!”“Then I’ll wait and see.” Pei Yiyuan’s arm wrapped around him as he lifted Ye Si Nian up to carry to the bathroom. He did not care and decided to change to a more important topic, “I just went out for a walk and bought your favourite. Porridge…”(Visited 1 times, 1 visits today)
[...]
  • Related