Home > Enterprise >  Optimal way to parallel scraping into a dataframe/csv file
Optimal way to parallel scraping into a dataframe/csv file

Time:10-05

Let's say I have a dataframe full of data, a column containing different urls and I want to scrape a price on the page of each url of the dataframe (which is pretty big, more than 15k lines). And I want this scraping to run continously (when it reaches the end of the urls, it starts over again and again). The last column of the dataframe (prices) would be updated everytime a price is scraped.

Here is a visual example of a toy dataframe :

Col 1 ... Col N  URL                             Price
XXXX  ... XXXXX  http://www.some-website1.com/   23,5$
XXXX  ... XXXXX  http://www.some-website2.com/   233,5$
XXXX  ... XXXXX  http://www.some-website3.com/   5$
XXXX  ... XXXXX  http://www.some-website4.com/   2$
.
.
.

My question is : What is the most efficient way to scrape those URLs using a parallel method (multi-threading ...) knowing that I can implement the solution with request/selenium/bs4 ... (I can learn pretty much anything) So I would like a theoretical answer more than some lines of codes but if you have a block to send don't hesitate :)

Thank you

CodePudding user response:

You can use next example how to check URLs periodically. It uses itertools.cycle with df.iterrows. This generator is then used in Pool.imap_unordered to get the data:

import requests
from time import sleep
from itertools import cycle
from bs4 import BeautifulSoup
from multiprocessing import Pool

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
}


def get_data(tpl):
    idx, row = tpl

    r = requests.get(row["url"], headers=headers)
    soup = BeautifulSoup(r.content, "html.parser")

    sleep(1)

    return (
        idx,
        soup.find(class_="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)").text,
    )


if __name__ == "__main__":

    c = cycle(df.iterrows())

    with Pool(processes=2) as p:
        for i, (idx, new_price) in enumerate(p.imap_unordered(get_data, c)):
            df.loc[idx, "Price"] = new_price

            # print the dataframe only every 10th iteration:
            if i % 10 == 0:
                print()
                print(df)
            else:
                print(".", end="")

Prints:

...

                                     url   Price
0  https://finance.yahoo.com/quote/AAPL/  139.14
1  https://finance.yahoo.com/quote/INTC/   53.47
.........

...and so on

df used:

                                     url
0  https://finance.yahoo.com/quote/AAPL/
1  https://finance.yahoo.com/quote/INTC/
  • Related