Home > Blockchain >  How can I implement multiprocessing to this web scraping code? Should I use multi threading instead?
How can I implement multiprocessing to this web scraping code? Should I use multi threading instead?

Time:05-04

What I'm trying to achieve is shorten the amount of time needed to complete scraping process and store all the data in a dictionary (the dictionary is Untiters keys are usernames, values are the amount of times user made a post with a specific name) I used this site as a tutorial but I couldn't figure out how to implement what's explained there on my code. Here is the code, sorry if I provided an unnecessarily big portion of the code.

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
z = 0
Untitleds = ["Sin título","Untitled","Sans titre","İsimsiz","Ohne Titel","بلا عنوان",
             "Без названия","无标题","夕イトルなし"]
Untiters = {}
Untits = []

x = 138
for i in range(1,20):
    y = x   1
    x = y
    Id = y
    link = "https://folioscope.co/blank/"   str(Id)
    Url = (link)
    R = requests.get(Url)
    Soup = BeautifulSoup(R.text,"html5lib")
    Pretitle = (Soup.find("div",{"class":"container_padding"}))
    Title = Pretitle.div.text
    if Title in (Untitleds):
        Prename = Soup.find("div",{"class":"padding_bottom_normal"})
        Name = Prename.a.text
        Untitled = z   1
        z = Untitled

        if Name not in Untiters:
            Untiters.update({Name : 1})
        else:
            c0 = Untiters[Name]
            c1 = c0   1
            Untiters[Name] = c1
        Untits.append(Title)
        print (Title, Name)

CodePudding user response:

To use multiprocessing.Pool to get data from the site, you can use following example:

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup


def get_data(id_):
    url = "https://folioscope.co/blank/"   str(id_)
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    title = soup.select_one("#animation_container .title") or ""
    if title:
        title = title.text

    username = soup.select_one(".username") or ""
    if username:
        username = username.text

    return id_, title, username


if __name__ == "__main__":
    with Pool() as pool:
        for id_, title, username in pool.imap_unordered(
            get_data, range(138, 158)
        ):
            if title and username:
                print("{:<4} {:<40} {}".format(id_, title, username))

                # here you can add the result to list, filter duplicates etc.

Prints:

153  First attempt                            CyberAly
149  Minecraft Loop                           MisterD
142  An Idea!                                 Pyro
148  Untitled                                 szymun
152  Thunder                                  dpknyk1993
139  Untitled                                 WoopDeDoo
146  Untitled                                 szymun
144  Loop                                     pjrd
138  Blink                                    fairyfina
140  Test                                     sknob
154  Dragon Ball kameha                       piedicmolkok
157  Boom                                     animation33
156  Tree in wind                             CyberAly
  • Related