Let's say I have a dataframe full of data, a column containing different urls and I want to scrape a price on the page of each url of the dataframe (which is pretty big, more than 15k lines). And I want this scraping to run continously (when it reaches the end of the urls, it starts over again and again). The last column of the dataframe (prices) would be updated everytime a price is scraped.
Here is a visual example of a toy dataframe :
Col 1 ... Col N URL Price
XXXX ... XXXXX http://www.some-website1.com/ 23,5$
XXXX ... XXXXX http://www.some-website2.com/ 233,5$
XXXX ... XXXXX http://www.some-website3.com/ 5$
XXXX ... XXXXX http://www.some-website4.com/ 2$
.
.
.
My question is : What is the most efficient way to scrape those URLs using a parallel method (multi-threading ...) knowing that I can implement the solution with request/selenium/bs4 ... (I can learn pretty much anything) So I would like a theoretical answer more than some lines of codes but if you have a block to send don't hesitate :)
Thank you
CodePudding user response:
You can use next example how to check URLs periodically. It uses itertools.cycle
with df.iterrows
. This generator is then used in Pool.imap_unordered
to get the data:
import requests
from time import sleep
from itertools import cycle
from bs4 import BeautifulSoup
from multiprocessing import Pool
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
}
def get_data(tpl):
idx, row = tpl
r = requests.get(row["url"], headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
sleep(1)
return (
idx,
soup.find(class_="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)").text,
)
if __name__ == "__main__":
c = cycle(df.iterrows())
with Pool(processes=2) as p:
for i, (idx, new_price) in enumerate(p.imap_unordered(get_data, c)):
df.loc[idx, "Price"] = new_price
# print the dataframe only every 10th iteration:
if i % 10 == 0:
print()
print(df)
else:
print(".", end="")
Prints:
...
url Price
0 https://finance.yahoo.com/quote/AAPL/ 139.14
1 https://finance.yahoo.com/quote/INTC/ 53.47
.........
...and so on
df
used:
url
0 https://finance.yahoo.com/quote/AAPL/
1 https://finance.yahoo.com/quote/INTC/