Home > other >  how to iterate through a list of URL and save it to CSV?
how to iterate through a list of URL and save it to CSV?

Time:08-15

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US, en;q=0.5'}

URL = "https://www.amazon.com/TRESemmé-Botanique-Shampoo-Nourish-Replenish/dp/B0199WNJE8/ref=sxin_14_pa_sp_search_thematic_sspa?content-id=amzn1.sym.a15c61b7-4b93-404d-bb70-88600dfb718d:amzn1.sym.a15c61b7-4b93-404d-bb70-88600dfb718d&crid=2HG5WSUDCJBMZ&cv_ct_cx=hair+tresemme&keywords=hair+tresemme&pd_rd_i=B0199WNJE8&pd_rd_r=28d72361-7f35-4b1a-be43-98e7103da70c&pd_rd_w=6UL4P&pd_rd_wg=JtUqB&pf_rd_p=a15c61b7-4b93-404d-bb70-88600dfb718d&pf_rd_r=DFPZNAG391M5JS55R6HP&qid=1660432925&sprefix=hair+tresemme,aps,116&sr=1-3-a73d1c8c-2fd2-4f19-aa41-2df022bcb241-spons&smid=A3DEFW12560V8M&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExQlM3VFpGRVM5Tk8wJmVuY3J5cHRlZElkPUEwNjE5MjQwM01JV0FNN1pOMlRHSSZlbmNyeXB0ZWRBZElkPUEwNTA1MDQyMlQ5RjhRQUxIWEdaUiZ3aWRnZXROYW1lPXNwX3NlYXJjaF90aGVtYXRpYyZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU&th=1"
webpage = requests.get(URL, headers=headers)
soup = BeautifulSoup(webpage.content)
rank = soup.select_one('#detailBulletsWrapper_feature_div span:-soup-contains("Best Seller")').contents[2].get_text().split()[0]

Category = soup.select_one('#detailBulletsWrapper_feature_div span:-soup-contains("Best Seller")').contents[2].get_text().split()[2:6]

Category = ' '.join(Category)


type(rank)
type(Category)

import string
for char in string.punctuation:
    rank = rank.replace(char, '')
    
print(rank)
print(Category)

I have other URLs similar to this and I want to loop through them: Here are the links: How can I loop through them and save them to a csv file. Thank you very much in advanced!

URL = ['https://www.amazon.com/Dove-Intensive-Concentrate-Technology-Protects/dp/B0B1VVXTKL',
             'https://www.amazon.com/Dove-Intensive-Concentrate-Conditioner-Technology/dp/B0B1VXFLQ2']

CodePudding user response:

You could use a for-loop to iterate the list:

for url in URL:
    webpage = requests.get(url, headers=headers)
    soup = BeautifulSoup(webpage.content)

Note: amazon do not want to be scraped, so it is a question of time, that they will block you. May use some delay, rotating proxy, ...

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US, en;q=0.5'}
URL = ['https://www.amazon.com/Dove-Intensive-Concentrate-Technology-Protects/dp/B0B1VVXTKL',
             'https://www.amazon.com/Dove-Intensive-Concentrate-Conditioner-Technology/dp/B0B1VXFLQ2']
data = []
for url in URL:
    webpage = requests.get(url, headers=headers)
    soup = BeautifulSoup(webpage.content)
    data.append({
        'url':url,
        'rank':soup.select_one('#detailBulletsWrapper_feature_div span:-soup-contains("Best Seller")').contents[2].get_text().split()[0],
        'other things':'...'
    })

pd.DataFrame(data).to_csv('myfile.csv', index=False)
  • Related